Browse Prior Art Database

Applying Deep Learning to Oil and Gas Due Diligence and Contract Compliance

IP.com Disclosure Number: IPCOM000244351D
Publication Date: 2015-Dec-04
Document File: 5 page(s) / 73K

Publishing Venue

The IP.com Prior Art Database

Related Documents

Related Topics: OTHER

Abstract

In oil and gas and other industries, assets are long-lived and are embodied in tens of thousands of complex legal documents. Packaging these assets for a sale, or analyzing these assets for a purchase due diligence, is daunting and often impossible. Time frames for divesting and acquiring assets are measured in weeks. This leaves no time to thoroughly assess the complex rights and obligations in these agreements. Further, insuring ongoing compliance with these agreements is hit or miss, despite the sophistication of modern ERP software. Deep machine learning techniques can efficiently tackle the high complexity and unstructured nature of theses agreements. Deep learning is defined here as the combination of data cleansing technology, advanced machine learning technology (supervised and unsupervised) applied on a continuous basis, industry-wide data persistence, and intelligent user interaction that encourages the capture of expert knowledge.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 31% of the total text.

Page 01 of 5

Applying Deep Learning to


Oil and Gas Due Diligence and Contract Compliance.

December 1, 2015

BACKGROUND AND INTRODUCTION

1. In oil and gas and other industries, assets are long-lived and are embodied in tens of thousands of complex legal documents.

2. Packaging these assets for a sale, or analyzing these assets for a purchase due diligence, is daunting and often impossible. Time frames for divesting and acquiring assets are measured in weeks. This leaves no time to thoroughly assess the complex rights and obligations in these agreements.

3. Further, insuring ongoing compliance with these agreements is hit or miss, despite the sophistication of modern ERP software.

4. Deep machine learning techniques can efficiently tackle the high complexity and unstructured nature of theses agreements.

5. Deep learning is defined here as the combination of data cleansing technology, advanced machine learning technology (supervised and unsupervised) applied on a continuous basis, industry-wide data persistence, and intelligent user interaction that encourages the capture of expert knowledge.

6. The system and business methods presented here comprise five modules that work together: (1) Data Cleansing Module, (2) Continuous Learning Module, (3) Knowledge Capture Module, (4) Knowledge Sharing Module, and, (5) Industry-wide Master Data Module. Each module is described below.

DATA CLEANSING MODULE

7. The tens of thousands of complex legal documents are commonly stored as scanned images, with selected data elements inside an ERP system. The scanned images are often OCRed, with typical recognition error rates of 20%. The asset data inside ERP systems typically has a 20%- 30% rate of errors and omissions.

December 1, 2015

1 of 5


Page 02 of 5

8. OCR data should be effectively cleansed using a master repository of phrases embodied in bigrams, trigrams, 4-grams and 5-grams, with frequency data. These phrases should be generated by harvesting existing OCR results and pruning errors using clustering techniques. These phrases should be used to correct OCR errors using algorithms like Viterbi and Hidden Markov Models.

9. The corrected OCR results should be stored page by page and document by document in a database to support retrieval and machine learning. A TF-IDF vector should be created for every page and document, and stored in this database.

10. ERP master data should be extracted in a denormalized and serialized format, such as JSON, and stored in a document-oriented database. Each of these records should be linked to their related document images and corresponding OCR text. A random sample of these records should be extracted, and each data element should be examined against the corresponding document images and marked as valid, invalid or suspect. The data elements with the highest error rates should be registered in the Continuous Learning Module.

11. ERP transaction data should also be extracted in a denormalized and serialized format, such as JSON,...