Spoken language processing

Institute of Scientific and Industrial Research, Osaka University

Speech technologies has been greatly improved theses days based on the data-oriented methodology. Increased computation power enabled us to use large mount of data easily. The stochastic method such as HMM (Hidden Markov Model) provides us with more accurate speech recognition. The concatenation method using the wave dictionary synthesizes very clear artificial speech. Such progress has spread the research area over challenging application of speech technologies.

A spoken dialog system is one of the ultimate goals of speech research. This should be realized as integration of advanced spoken language technologies which include continuous speech recognition, language processing, dialog management, sentence generation, and speech synthesis. Although this goal is rather ambitious, we can expect it will be realized in the near future. As speech is the most natural and easy communication channel for human, interface systems between human and machine can be very friendly through spoken language. Spoken language processing is also the key technology of multi-media systems.

The major results in speech research obtained thus far in our lab include a speech recognition expert named SPREX which recognizes continuous speech by simulating the human way of feature parameter trajectory reading using knowledge base technology, a language processing subsystem for speech understanding named ASP which stands for ASsociation-based Parser, dialog management system named MASCOTS-I based on SR(Stimulus-Response)-plans and a speech synthesis system from concept representation. All these modules are being combined into a general speech interface system which is independent of the dialog domain. We are aiming at speech communication with various intelligent performance systems (IPS). Conventional architecture of spoken dialog system is often dependent on the domain, which makes it difficult to reuse its basic component of dialog management in other systems with different domains. In order to realize a portable speech interface, the dialog manager in our interface system is designed so as to use as less domain-dependent knowledge as possible.

Reports: 1992 - 1994

1. Dialog model

The dialog manager in our speech interface system has many useful functions. Predicting the next utterance provides semantic constraints which help spoken language recognition and extracting dialog information contributes to synthesis of natural spoken language. These functions can not be realized without the dialog model which is the most important concept for constructing spoken dialog systems.

There are in general at least two kinds of structures in dialog: one is the utterance pair and another is the discourse segment. We model these structures of a dialog by a plan architecture and a network model of topic transitions.

1.1 SR-plan

In a cooperative and goal-oriented dialog, a person usually replies when he/she is asked. We model this characteristic of a dialog as a sort of plan, called SR-plan. An SR-plan corresponds to an interaction composed of a stimulus (requirement) and a response to it.

We defined two types of SR-plans for a dialog between an Intelligent Performance System (IPS) on computer and a human user. One is the system SR-plan which deals with an interaction initiated by an IPS's stimulus followed by the user's response. The other is the user SR-plan for an interaction initiated by a user's stimulus followed by the IPS's response. We classified the SR-plans into 17 categories according to the types of interactions. A SR-plan basically consists of a sequence of a stimulus and a response. A system SR-plan and a user SR-plan can optionally contain an IPS's evaluation and a user's confirmation, respectively.

1.2 TPN

In general, we often begin a dialog with a rough topic and rarely changes the topic into a non-related one. The topic changes into a more elaborated one and often comes back to the previous topic explicitly or implicitly. A topic can be elaborated into limited topics only, called descendant topics. We believe that the relation between a topic and its descendant topics is independent of individual dialogs. Such a set of descendant topics is defined as a TP (topic packet). We propose a kind of network composed of some TPs linked each other in order to trace topics in a dialog. This network is named TPN (topic packet network).

There are two types of topic transitions in a dialog: topic continuation and topic change. A topic does not usually change in an answer utterance to a question. This is labeled topic continuation. A topic change means that the topic of an utterance is different from that of the preceding utterance. Furthermore, topic change can be classified into three types: topic shift, topic elaboration, and topic termination. In the TPN model, these topic changes are easily represented as a move to another topic in the same TP, as a move to the descendant TP, and as a return to the ancestor TP, respectively.

2. MASCOTS-II

The previous version of MASCOTS, referred to as MASCOTS-I, was developed only for speech understanding, particularly for the language processing subsystem ASP. We extended MASCOTS-I to cover the dialog processing in speech interface system together with the dialog model, maintaining its generality. The new version of MASCOTS is called MASCOTS-II. MASCOTS-II plays important roles both in understanding and generation of spoken language.

2.1 Dialog processing for spoken language understanding

We can understand spoken language by listening speech by ears. It is impossible to accomplish it without high level knowledge, such as linguistics and so on. High level knowledge is also important in automatic understanding of spoken language by computer, because of the incompleteness of speech recognition. Knowledge about dialog is particularly useful for understanding a cooperative and goal-oriented spoken dialog. One of efficient way of how to use dialog knowledge is to predict the next utterance. The utterance prediction provides the language processing subsystem with some useful constraints on the meaning and vocabulary in the utterance. This technique becomes more important for tasks with large or middle size vocabulary.

We proposed a method of utterance prediction based on two levels of dialog models, SR-plan and TPN. Each system and user SR-plan has templates for analyzing the user's response and stimulus, respectively. MASCOTS-II predicts the user's next utterance using the templates instantiated by the preceding IPS's utterance and plausible topics. It is not difficult to predict the user's response to an IPS's stimulus because the meaning of the user's utterance is constrained by that of IPS. The user's stimuli are difficult to predict without knowledge about the dialog domain. The TPN model is introduced in order to limit the range of meaning of the user's next utterance. The templates for the user's stimuli are described in the user SR-plans in terms of topic components. The topic predicted by the TPN instantiates the topic components. The evaluation experiment showed that dialog knowledge of SR-plan and TPN could prune the candidate word space. Introduction of them removed more than 40% of words from candidates in speech recognition result. When the topic is correctly identified, about 60% words can be reduced drastically.

2.2 Dialog processing for spoken language generation

The most important point of spoken language generation is to realize correct transmission of what the IPS understands and what the IPS wants to say to the user. From an aspect of design of spoken dialog systems, separating roles between the IPS and the interface component is very important to facilitate system development. As for speech generation, the IPS plays roles of problem solving and determination of 'what to say'. The interface system should convert it into spoken language appropriate to the dialog context in both sense of prosody and surface sentence. MASCOTS-II carries out several functions to improve the quality of output spoken language based on an architecture of concept-to-speech conversion technique.

It is frequently observed in a dialog that some words in sentences are emphatically presented in the sense of prosody and/or verbal expression. If those words are not emphasized when the sentence is uttered isolatedly, that is, without dialog context, their emphasis is considered as being intrinsically caused by the dialog context. These dialog-dependent emphases should be dealt with not in the IPS but in the interface system. MASCOTS-II extracts words to be emphasized making use of the dialog history and the dialog model. Introduction of the concept-to-speech architecture makes such procedures very easy.

In general, there are inevitable errors in recognition of the user input speech. Accurate communication requires a mechanism which informs the user of the recognition results of his/her speech in an implicit or explicit way. Otherwise, the user may unexpectedly receive the wrong information from the IPS and he/she is unaware of the error. To avoid such a situation, notification of the recognition result to the user is necessary. MASCOTS-II has a mechanism of generating an explicit or an implicit notification utterance in order to verify the received information.

3. Spoken language generation based on concept-to-speech conversion

The problem of "from what speech is converted" is a crucial issue in speech synthesis. The former studies on speech synthesis paid special attention to improving the quality of the synthetic speech. Natural language text was usually the source of synthetic speech, and this speech synthesis technique is called Text-To-Speech (TTS). In spoken dialog systems, however, the IPS first represents what it wants to say at the concept level, not in text. The speech synthesis component should directly convert concepts into speech. The concept-to-speech conversion (CTS) is a key technique in the general speech interface system.

3.1 SOCS

For human-computer interaction through spoken language, natural language text is not the most convenient form of describing messages from the IPS to the speech synthesis component. The message representation form should be designed in consideration of the interaction between them. From such a point of view, we designed a speech synthesis architecture based on CTS, called SOCS (speech output from case structure representation). SOCS plays the central roles for speech output in our general speech interface system, and it converts concept representation generated by the IPS into speech. The concept representation scheme is defined based on the case structure and some phrase patterns.

The CTS system has several advantages over conventional TTS conversion systems; (1) there is no need for text analysis because CTS generates sentences from the concept representation itself, (2) some prosodic features, such as emphasis, utterance speed, and so on, can be embedded in the concept representation, and (3) a concept representation can be transformed into various sentences according to dialog context. IPS should determine only what to say not considering how to say.

The SOCS system directly controls prosodic parameters using the concept representation with two built-in mechanisms: the Prosody Modification Function (PMF) in the custom template and the pause marker. In SOCS, the custom templates play an important role and generate the prosodic parameters for the prepared patterns. Since many IPS systems have output message patterns which are frequently used to answer the user's questions and to query, it is very effective for the CTS system to prepare such patterns in order to generate very natural synthetic speech.

3.2 Context-dependent generation of sentence and prosody in dialog

Surface sentence and prosody of the utterance in a dialog varies in accordance with the dialog context even if its meaning is the same. To generate spoken utterances with high quality apparently requires understanding of the diversity of surface sentences and prosodic parameters and modeling them based on the dialog context.

In the CTS conversion, sentence generation is a very important issue as well as speech wave production. We analyzed dialog context dependencies of the surface sentence through an experiment of sentence generation. Differences of the generated sentences are classified into 22 categories. Some categories, such as change of function words, idiomatic expression, and so on, are of course inherent in Japanese. We believe that other categories are independent of the languages. Another experiment of preference to expressions of the surface sentence made it clear that most of the differences observed in the first experiment are substantial for utterance generation in dialog.

We also investigated context-dependencies of prosody. They are modeled by the linear regression technique and some features of the dialog context. This stochastic modeling needs two kinds of speech data, isolated utterances without dialog context and contextual utterances in a dialog. The conventional TTS rules were extracted from the isolated utterances and they were applied to the dialog utterances. While this first set of rules captures prosodic characteristics of isolated utterances very well, it knows nothing about how to predict prosody appropriate to the dialog context. Most errors in predicting prosody of the dialog utterances are caused by the contextual effect of dialog. The second set of rules was extracted from these prediction errors based on the linear regression method. It adjusted prosody to the dialog context and total prediction errors was decreased from 38.9Hz to 27.2Hz for data which have large prediction errors by the first rules.


MizLab Homepage