Domain-based Evaluation Metric for Spoken Dialog Systems
Original Publication Date: 2009-Jun-16
Included in the Prior Art Database: 2009-Jun-16
A new approach is proposed towards the evaluation of spoken dialog systems. The novelty of our method is based on the utilization of domain-specific knowledge combined with the deterministic measurement of dialog system performance on a set of individual tasks within the domain. The proposed methodology thus attempts to answer questions such as: "How well is my dialog system performing on a specific domain?", "How much has my dialog system improved since the previous version?", "How much is my dialog system better/worse than other dialog systems performing on that domain?" Keywords: dialog, evaluation, scoring, multimodal, speech recognition
Current methods and techniques for measuring the performance of spoken dialog systems are still very immature. They are either based on subjective evaluation (Wizard of Oz or other usability studies) or they are borrowing automatic measures used in speech recognition, machine translation or action classification,
incomplete picture of the performance of the system. Nowadays, dialog systems are evaluated by measures used in speech recognition, such as word error rate (WER) or action classification error rate, by techniques that measure primarily dialog coherence
, and by systems supporting human judgment-based evaluation, such as PARADISE
. What is particularly missing in this area is (1) a measurement of performance for a particular domain, (2) a possibility to compare one dialog system with others, and (3) the evaluation of a progress during the development of a dialog system. By the the proposed scoring presented herein it is attempted to address these three challenges.
1.2 The Elements of the proposed Metric
The proposed score consists of two ingredients both of which range from 0 to 1:
B) dialog efficiency (DE) score.
Both scores are described in the following chapters. Note that the results of domain coverage and dialog efficiency may be combined into a single compound score to attain a single overall characteristic (the eigen value) of the assessed dialog system.
The proposed score relies on a good understanding of the dialog domain that is described in the form of a domain task ontology. The more expert knowledge is projected into the domain ontology, the more reliable results can be expected from the proposed score.
2. Capturing Domain Ontology
The cornerstone of the proposed approach is to evaluate spoken and multi-modal dialog systems within a predefined,
(and typically narrow) domain. Many speech and multimodal applications for various domains, such as music selection, TV remote control, in-car navigation and phone control are developed using grammars, language models and natural language understanding techniques. In order to compare two spoken dialog systems that deal with the same domain, first the domain is diligently described using the task ontology. This restricted ontology represents the human expert knowledge of the domain and is encoded as a set of tasks
between the tasks: task generalization and aggregation. Individual tasks are defined as sequences of parameterized actions.
Actions are separable units of domain
functionality, such as volume control, song browsing or playback.
based Evaluation Metric for Spoken Dialog Systems
---based Evaluation Metric for Spoken Dialog Systemsbased Evaluation Metric for Spoken Dialog Systemsbased Evaluation Metric for Spoken Dialog Systems
A) domain coverage (DC) score,
with two kinds of relations
Table 1. Speech-enabled reference tasks for the ju...