Browse Prior Art Database

Boilerplate Text Detection in Clinical Documentation

IP.com Disclosure Number: IPCOM000245410D
Publication Date: 2016-Mar-08
Document File: 4 page(s) / 219K

Publishing Venue

The IP.com Prior Art Database

Abstract

Text designated as "Boilerplate" (i.e. pieces of text that are repeated in clinical documents without change such as disclaimers, instructions, etc.) often needs to be treated differently from the main body of text by natural language processing applications. The described approach to detecting boilerplate text combines two complementary methods: 1) a semi-supervised method that detects boilerplate text in a corpus of interest based on a training set of known boilerplate text and 2) a method based on locality-sensitive hashing that detects boilerplate text based on the patterns of substring repetition in the corpus of interest.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 31% of the total text.

Page 01 of 4

Boilerplate Text Detection in Clinical Documentation Abstract

Text designated as "Boilerplate" (i.e. pieces of text that are repeated in clinical documents without change such as disclaimers, instructions, etc.) often needs to be treated differently from the main body of text by natural language processing applications. The described approach to detecting boilerplate text combines two complementary methods: 1) a semi-supervised method that detects boilerplate text in a corpus of interest based on a training set of known boilerplate text and 2) a method based on locality-sensitive hashing that detects boilerplate text based on the patterns of substring repetition in the corpus of interest.

Introduction

Clinical documents describing patient encounters often include "boilerplate" information such as passages of text that are repeated verbatim (e.g., instructions or warnings). Boilerplate text should be identified during processing of the documents by computer-assisted coding (CAC) and other natural language processing (NLP) software, because treating it similarly to other text may lead to processing errors. Therefore, there is a need for software that can identify boilerplate text automatically. The precise definition of boilerplate text is application-dependent. One may pragmatically define boilerplate text as any passages of text that repeat from one document to another with little or no change, and which should be ignored or otherwise treated differently from other text by a given NLP application.

Automatic boilerplate text identification is distinguished by several special characteristics that motivate the choice of the proposed methods. Boilerplate text identification may be performed using rule-based methods. Because of the highly repetitive nature of boilerplate, this is normally performed by matching incoming documents against snippets of boilerplate previously identified as such by analysts. A training corpus of documents for modeling can thus be created from boilerplate identified with high precision. However, since clinical document templates exhibit considerable variety and analysts do not have the time to inspect them all, some boilerplate will usually remain unidentified. Thus, statistical methods for identification of boilerplate must deal with training corpora containing a potentially substantial proportion of false negatives. We address this problem by use of semi-supervised learning methods. Furthermore, some boilerplate from new customers may be sufficiently different from any boilerplate previously identified by the analysts to prevent a model trained on historical data from identifying it. While some boilerplate for a new customer may be identified during customer implementation, additional boilerplate patterns are likely to appear afterwards. There is a need for detecting such patterns automatically, before they are brought to the attention of analysts by virtue of causing the CAC system to make incorrect code...