A spoken dialog system is one of the ultimate goals of speech research. This should be realized as integration of advanced spoken language technologies which include continuous speech recognition, language processing, dialog management, sentence generation, and speech synthesis. Although this goal is rather ambitious, we can expect it will be realized in the near future. As speech is the most natural and easy communication channel for human, interface systems between human and machine can be very friendly through spoken language. Spoken language processing is also the key technology of multi-media systems.
The major results in speech research obtained thus far in our lab include a speech recognition expert named SPREX which recognizes continuous speech by simulating the human way of feature parameter trajectory reading using knowledge base technology, a language processing subsystem for speech understanding named ASP which stands for ASsociation-based Parser, dialog management system named MASCOTS-I based on SR(Stimulus-Response)-plans and a speech synthesis system from concept representation. All these modules are being combined into a general speech interface system which is independent of the dialog domain. We are aiming at speech communication with various intelligent performance systems (IPS). Conventional architecture of spoken dialog system is often dependent on the domain, which makes it difficult to reuse its basic component of dialog management in other systems with different domains. In order to realize a portable speech interface, the dialog manager in our interface system is designed so as to use as less domain-dependent knowledge as possible.
Reports: 1992 - 1994
There are in general at least two kinds of structures in dialog: one is the utterance pair and another is the discourse segment. We model these structures of a dialog by a plan architecture and a network model of topic transitions.
We defined two types of SR-plans for a dialog between an Intelligent Performance System (IPS) on computer and a human user. One is the system SR-plan which deals with an interaction initiated by an IPS's stimulus followed by the user's response. The other is the user SR-plan for an interaction initiated by a user's stimulus followed by the IPS's response. We classified the SR-plans into 17 categories according to the types of interactions. A SR-plan basically consists of a sequence of a stimulus and a response. A system SR-plan and a user SR-plan can optionally contain an IPS's evaluation and a user's confirmation, respectively.
There are two types of topic transitions in a dialog: topic continuation and topic change. A topic does not usually change in an answer utterance to a question. This is labeled topic continuation. A topic change means that the topic of an utterance is different from that of the preceding utterance. Furthermore, topic change can be classified into three types: topic shift, topic elaboration, and topic termination. In the TPN model, these topic changes are easily represented as a move to another topic in the same TP, as a move to the descendant TP, and as a return to the ancestor TP, respectively.
2. MASCOTS-II
The previous version of MASCOTS, referred to as MASCOTS-I, was
developed only for speech understanding, particularly for the language
processing subsystem ASP. We extended MASCOTS-I to cover the dialog
processing in speech interface system together with the dialog model,
maintaining its generality. The new version of MASCOTS is called
MASCOTS-II. MASCOTS-II plays important roles both in understanding
and generation of spoken language.
We proposed a method of utterance prediction based on two levels of dialog models, SR-plan and TPN. Each system and user SR-plan has templates for analyzing the user's response and stimulus, respectively. MASCOTS-II predicts the user's next utterance using the templates instantiated by the preceding IPS's utterance and plausible topics. It is not difficult to predict the user's response to an IPS's stimulus because the meaning of the user's utterance is constrained by that of IPS. The user's stimuli are difficult to predict without knowledge about the dialog domain. The TPN model is introduced in order to limit the range of meaning of the user's next utterance. The templates for the user's stimuli are described in the user SR-plans in terms of topic components. The topic predicted by the TPN instantiates the topic components. The evaluation experiment showed that dialog knowledge of SR-plan and TPN could prune the candidate word space. Introduction of them removed more than 40% of words from candidates in speech recognition result. When the topic is correctly identified, about 60% words can be reduced drastically.
It is frequently observed in a dialog that some words in sentences are emphatically presented in the sense of prosody and/or verbal expression. If those words are not emphasized when the sentence is uttered isolatedly, that is, without dialog context, their emphasis is considered as being intrinsically caused by the dialog context. These dialog-dependent emphases should be dealt with not in the IPS but in the interface system. MASCOTS-II extracts words to be emphasized making use of the dialog history and the dialog model. Introduction of the concept-to-speech architecture makes such procedures very easy.
In general, there are inevitable errors in recognition of the user input speech. Accurate communication requires a mechanism which informs the user of the recognition results of his/her speech in an implicit or explicit way. Otherwise, the user may unexpectedly receive the wrong information from the IPS and he/she is unaware of the error. To avoid such a situation, notification of the recognition result to the user is necessary. MASCOTS-II has a mechanism of generating an explicit or an implicit notification utterance in order to verify the received information.
3. Spoken language generation based
on concept-to-speech conversion
The problem of "from what speech is converted" is a crucial issue in
speech synthesis. The former studies on speech synthesis paid special
attention to improving the quality of the synthetic speech. Natural
language text was usually the source of synthetic speech, and this
speech synthesis technique is called Text-To-Speech (TTS). In spoken
dialog systems, however, the IPS first represents what it wants to say
at the concept level, not in text. The speech synthesis component
should directly convert concepts into speech. The concept-to-speech
conversion (CTS) is a key technique in the general speech interface
system.
The CTS system has several advantages over conventional TTS conversion systems; (1) there is no need for text analysis because CTS generates sentences from the concept representation itself, (2) some prosodic features, such as emphasis, utterance speed, and so on, can be embedded in the concept representation, and (3) a concept representation can be transformed into various sentences according to dialog context. IPS should determine only what to say not considering how to say.
The SOCS system directly controls prosodic parameters using the concept representation with two built-in mechanisms: the Prosody Modification Function (PMF) in the custom template and the pause marker. In SOCS, the custom templates play an important role and generate the prosodic parameters for the prepared patterns. Since many IPS systems have output message patterns which are frequently used to answer the user's questions and to query, it is very effective for the CTS system to prepare such patterns in order to generate very natural synthetic speech.
In the CTS conversion, sentence generation is a very important issue as well as speech wave production. We analyzed dialog context dependencies of the surface sentence through an experiment of sentence generation. Differences of the generated sentences are classified into 22 categories. Some categories, such as change of function words, idiomatic expression, and so on, are of course inherent in Japanese. We believe that other categories are independent of the languages. Another experiment of preference to expressions of the surface sentence made it clear that most of the differences observed in the first experiment are substantial for utterance generation in dialog.
We also investigated context-dependencies of prosody. They are modeled by the linear regression technique and some features of the dialog context. This stochastic modeling needs two kinds of speech data, isolated utterances without dialog context and contextual utterances in a dialog. The conventional TTS rules were extracted from the isolated utterances and they were applied to the dialog utterances. While this first set of rules captures prosodic characteristics of isolated utterances very well, it knows nothing about how to predict prosody appropriate to the dialog context. Most errors in predicting prosody of the dialog utterances are caused by the contextual effect of dialog. The second set of rules was extracted from these prediction errors based on the linear regression method. It adjusted prosody to the dialog context and total prediction errors was decreased from 38.9Hz to 27.2Hz for data which have large prediction errors by the first rules.