Browse Prior Art Database

System and Method for Corpus Content Health check for Informaiton Retrieval systems Disclosure Number: IPCOM000247769D
Publication Date: 2016-Oct-06
Document File: 1 page(s) / 36K

Publishing Venue

The Prior Art Database


There are no accurate methods which can read and extract various data elements present in document accurately due to the inability of understanding and converting document formats such as pdf, ppt, word,xls, etc. These methodologies incorrectly extract data elements and augments noise at various levels. For appropriate ingestion of corpus and to get meaningful information from corpus, each document quality needs to be assessed. Assessment of document quality for ingestion is a manual process and the time required varies based on size of the project. But the manual process is prone to failure due to a lot of reasons (Image/page, table/page etc. gets overlooked). Moreover, if there are hundreds of documents, effort involved manually assessing the quality is huge, hence causing considerable delays in project implementation. A sizeable part of implementation time is consumed by the manual process and the random issues which occur due to lack of a standard process. This article provides a system and method to perform content health check using machine learning techniques.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 59% of the total text.

Page 01 of 1

System and Method for Corpus Content Health check for Informaiton Retrieval systems Proposed approach tries to perform corpus quality health check based on various corpus data elements present in corpus.

There is no unique solution which assesses over all quality in efficient way using generalized learning based methodology. Existing rule based systems very often fail in assess various document elements correctly due to the incapable of handling various cases. Our proposed solution effectively uses generalized learning based methodology by automatically learning weights for various document elements and then generating over all score. Based on score threshold, document fitness will be determined.

Summary of the approach:

Document fitness will be identified based on various elements such as segments, titles, tables, charts, images extraction accuracy. During extraction of these elements, noise will be generated and this has to be identified and cleaned. Over-all document quality will be determined based on various elements accuracy. A method based on unsupervised regression technique will be used to deduce document's overall quality.

A Machine Learning based methodology which learns various qualitative weights for extracted document elements to judge over all document quality using regression technique. We assume that from each document various elements such as sections, paragraphs, titles, tables, charts, figures, images, noisy segments are extracted. No single...