Browse Prior Art Database

Automatic, In-Domain, Question/Answer-Set Generation

IP.com Disclosure Number: IPCOM000245124D
Publication Date: 2016-Feb-10
Document File: 4 page(s) / 41K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a system for automatically generating a set of domain-specific question-answer (QA) pairs from a domain-specific corpus and an existing set of domain-general QA pairs. The output of the system is a high-quality QA set with good coverage suitable for training a QA system to the new domain.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 31% of the total text.

Page 01 of 4

Automatic,

Disclosed is a system for automatically generating a set of domain-specific question-answer (QA) pairs from a domain-specific corpus and an existing set of domain-general QA pairs. The output of the system is a high-quality QA set with good coverage suitable for training a QA system to the new domain. This invention aims to reduce the human element to drastically decrease the time, cost, and expertise needed in question creation. Statistical QA systems require large quantities of training data in the form of QA pairs. System accuracy is directly correlated to the quantity and quality of the questions provided during the training phase. Currently, creating quality questions is a time-consuming and expensive manual process. Generating QA pairs in order to adapt an existing QA system to handle a new topic domain typically requires upwards of hundreds of person-hours, often from subject matter experts. This expense compounds as clients request handling of multiple new domains in a year. A system for generating question-answer sets from a set of documents representing a particular domain and a known distribution of question types, the system comprising:
1. Initializing a target distributional specification of question types
2. Pending input of an existing QA corpus, analyzing the existing QA corpus for distributional information
3. Populating the target distributional specification in 1 with any distributional information from 2
4. Pending user input of distributional information, modifying the target specification in step 3
5. Receiving as input a document corpus
6. Initializing a set of generated QA pairs
7. Initialize a distributional specification for the generated set in step 6
8. Selecting a question type for generation by sampling from the distributional specification
9. Selecting a source document for question generation by sampling from the corpus
10. Selecting a source section for question generation by sampling from the document
11. Selecting a source paragraph for question generation by conditional sampling from the section
12. Identify candidate sentences that support generation of the appropriate question type
13. Select a source sentence for question generation by sampling from the candidate sentences
14. Generate a question and answer of the selected type from the selected sentence
15. Add the generated question/answer pair from step 14 to the set of QA pairs in step 6
16. Update the generated distributional specification in step 7 with information about the QA pair in step 15
17. Repeat steps 8-16
18. At periodic checkpoints in the generation process, compare the target distributional specification in step 4 with the generated distributional specification in step 7 to identify mismatch
19. Pending a mismatch, generate an intermediate target distributional specification that downweights over-represented categories in the generated specification
20. Modify step 8 to sample from the intermediate target distribution until the next...