Browse Prior Art Database

Segmented Data Ingestion

IP.com Disclosure Number: IPCOM000237025D
Publication Date: 2014-May-27
Document File: 3 page(s) / 23K

Publishing Venue

The IP.com Prior Art Database

Abstract

A system and method for segmented data ingestion is disclosed.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 3

Segmented Data Ingestion

Disclosed is a system and method for segmented data ingestion.

Scenario

The Loader ingests the First Set of Media.

The Analyzer obtains the content, content-type and context for the loaded media. The System generates a model to analyze the pre-loaded content.

The Analyzer checks the regression statistics.

The Analyzer identifies the optimal media.

The Loader loads the optimal media.

Steps

The Analyzer obtains the details of the media from the Loader:
Content - a single media file


1.

Content Segments - the split media file, where the split is at a Chapter, key


2.

frame, paragraph, embedded content, etc.

Content-Type - eg txt/html, img/png, video/mp4


3.

Context - the bibliography, comments, tags, footnotes, endnotes, etc,


4.

The System creates a contextual content model for the media set's details.

The System generates regression statistics using any supported data mining models (clustering, k-means, nearest-neighbor).

The Analyzer reviews the statistics to determine:

Outlier Contents and Content-Types


1.

Outlier Confidence

2.

The Analyzer filters out the low-confidence content and content-types from the media. The Analyzer identifies the attributes of the high-confidence outlier content and content types from the first set.

The Analyzer determines whether a second set is needed based on threshold. The threshold may be predetermined or just in time.

The Loader loads the second set of media based on the model.

The system extracts the important structural and summary elements of the media
Abstract -

  Extract segments titled section Abstract - use XPath or Application Programming Interface (API) access to Section Content
Extract the first few segments of text - use the head of the Main Body
Extract document properties - use the API to access the document properties Citations -

Extract Citations from Medical or APA style

® - use Regular Expressions

  Extract Citations using any format - use XPath or Format styles
Index -

Extract Index Entries - use XPath or API Access to Section Content

1


Page 02 of 3

  Extract the final segments of text - use the tail of the Main Body Table of Contents -

  Extract the Table of Contents - use XPath or API Access to Table of Contents Entries
Footnotes and Endnotes -

Extract the Footnotes - use XPath or API Access


Links - The system extracts document mapping of internal links.

Extracts the Links / HREF - use XPath or API Access


Internal References - any other style may use Regular Expressions to find relevant materials

The abstract is the only initial and required node for the system.

The system stores the relevant text around the structural elements, and relations to the other elements.

The system may load segments of an underlying file format in order to generate the document structure.

Example for Open Document...