Segmented Data Ingestion
Publication Date: 2014-May-27
The IP.com Prior Art Database
A system and method for segmented data ingestion is disclosed.
Page 01 of 3
Segmented Data Ingestion
Disclosed is a system and method for segmented data ingestion.
The Loader ingests the First Set of Media.
The Analyzer obtains the content, content-type and context for the loaded media. The System generates a model to analyze the pre-loaded content.
The Analyzer checks the regression statistics.
The Analyzer identifies the optimal media.
The Loader loads the optimal media.
The Analyzer obtains the details of the media from the Loader:
Content - a single media file
Content Segments - the split media file, where the split is at a Chapter, key
frame, paragraph, embedded content, etc.
Content-Type - eg txt/html, img/png, video/mp4
Context - the bibliography, comments, tags, footnotes, endnotes, etc,
The System creates a contextual content model for the media set's details.
The System generates regression statistics using any supported data mining models (clustering, k-means, nearest-neighbor).
The Analyzer reviews the statistics to determine:
Outlier Contents and Content-Types
The Analyzer filters out the low-confidence content and content-types from the media. The Analyzer identifies the attributes of the high-confidence outlier content and content types from the first set.
The Analyzer determines whether a second set is needed based on threshold. The threshold may be predetermined or just in time.
The Loader loads the second set of media based on the model.
The system extracts the important structural and summary elements of the media
Extract segments titled section Abstract - use XPath or Application Programming Interface (API) access to Section Content
Extract the first few segments of text - use the head of the Main Body
Extract document properties - use the API to access the document properties Citations -
Extract Citations from Medical or APA style
® - use Regular Expressions
Extract Citations using any format - use XPath or Format styles
Extract Index Entries - use XPath or API Access to Section Content
Page 02 of 3
Extract the final segments of text - use the tail of the Main Body Table of Contents -
Extract the Table of Contents - use XPath or API Access to Table of Contents Entries
Footnotes and Endnotes -
Extract the Footnotes - use XPath or API Access
Links - The system extracts document mapping of internal links.
Extracts the Links / HREF - use XPath or API Access
Internal References - any other style may use Regular Expressions to find relevant materials
The abstract is the only initial and required node for the system.
The system stores the relevant text around the structural elements, and relations to the other elements.
The system may load segments of an underlying file format in order to generate the document structure.
Example for Open Document...