Coordinating Speech and Actions for
Animated Pedagogical Agents

Jason L. Elliott

May 6, 1997







Table of Contents

  1. Introduction

  2. Architecture
    1. Interaction Manager
    2. Pedagogical Planner
    3. ExplanAgent Planner

  3. Coordinated Speech and Actions
    1. CLIPS
    2. Our Implementation

  4. Example Scenario

  5. Conclusions




I. Introduction

The United States has been experiencing a crisis in education. In international surveys, the U.S. falls at the bottom of the rankings in quality and effectiveness of education. This problem can not be allowed to continue. One approach to solving the education crisis involves using computers to assist teachers in normal classroom education. Computer tutoring systems have been around for several decades, but the recent combination of entertainment and education software systems has markedly improved the effectiveness of intelligent "edutainment."

One of the key elements in edutainment software packages is the presence of an agent that is believable and entertaining. This animated agent is the "teacher" in the software package. It is used for both aural and visual communications of the subject material. However, the agent can not be just a puppet on strings that performs certain actions. Artificial intelligence can be coupled with the agent's visual and aural aspects to create an animated pedagogical agent. This new breed of agents is capable of responding to students' needs on an individual basis. The AI techniques used can allow the agent to adapt and react differently for each student that it encounters.

Animated pedagogical agents provide the users with a higher level of interaction than in previous intelligent tutoring systems. Students will be more likely to learn the subject material if they feel some connection with the agent or teacher who is presenting the material. Since the connection to the agent is the major factor in the effectiveness of the system, the agent must exhibit the social qualities that humans generally expect from one another.

One of the main obstacles for an animated pedagogical agent is believability. Humans expect a certain degree of emotion from the people they come into contact with; in order for the agent to be significant in the learning process, it must display some of those reactions and behaviors that are normally present in human relationships. These characteristics can be accomplished by coordinating the agent's speech and actions. The things that the agent says and does must fit together in a way that appears normal to the student. Without the coordination of speech and action, the agent would appear to be pre-programmed and impersonal. The difficulty in providing these sequences of speech and action is that they must not only be seamless and "make sense," but they must actually instruct the student and be of some use to the subject being taught.

Together with James Lester, Stuart Towns, and the Intellimedia Initiative at North Carolina State University, I have worked to implement the coordination of speech and actions for animated pedagogical agents. The Intellimedia Initiative is currently implementing a design-centered learning environment that is inhabited by an animated pedagogical agent named Cosmo. The purpose of this environment, home of the Internet Protocol Advisor, is to teach college-level students how packets of information are transferred across the internet. By using the inference engine CLIPS, Stuart and I have created a set of rules that govern the utterances and behaviors of Cosmo. By coordinating Cosmo's speech and actions based on a user model, the problem state, and the input or lack of input from the user, we hope to create a believable and highly effective animated pedagogical agent that can provide educators with a useful teaching tool.

This paper continues by describing the basic architecture by which our knowledge-based learning environment is constructed. It then proceeds to give detailed information on the implementation of coordinated speech and actions for animated pedagogical agents. Following the implementation section, there is an example of how the Pedagogical Planner and the ExplanAgent Planner utilize our work to explain problems in the Internet Protocol Advisor environment. The thesis concludes with some observations about knowledge-based learning environments.





II. Architecture

The Internet Protocol Advisor environment is based upon an architecture that contains three major components. These three components are: the Interaction Manager, the Pedagogical Planner, and the ExplanAgent Planner. Each of these sub-systems plays a part in the user's experience. Figure 1 shows the complete architecture for this knowledge-based learning environment.

  1. Interaction Manager -- deals with user input and output
  2. Pedagogical Planner -- determines strategy for problem sequencing
  3. ExplanAgent Planner -- controls how the agent explains the current problem


Figure 1. Learning Environment Architecture


A. Interaction Manager

The Interaction Manager handles all input and output from the system. The Interaction Manager contains an action interpreter that categorizes user input and relays that information to the other parts of the environment. After the user input has been processed and all explanation and agent behaviors have been chosen, the presentation engine updates the environment using the selected media. The Interaction Manager makes use of libraries of agent behaviors, utterances, and interface options. Currently, the interface to the Interaction Manager utilizes only mouse and keyboard input, but future implementations could include joysticks, virtual reality equipment, or any other input device. Figure 4 shows the interface for the Internet Protocol Advisor.


B. Pedagogical Planner

The second component of the architecture is the Pedagogical Planner. This unit determines the characteristics of the problems that the environment presents to the user. By using a diagnostic system in conjunction with the user model, problem state, dialogue history, and other problem generation knowledge, the Pedagogical Planner can create a new problem that is appropriate for the user. When the user attempts to solve a problem, the diagnostic system compares the user's solution with the correct solution and determines what type of explanation is needed to help the user understand the problem. Contained in the user model is information about the user's knowledge of the focus topic and any mistakes the user has made. The problem state provides the topic being emphasized in this problem, the characteristics of the user's selection, and the correct selection for this problem. The diagnostic system uses all of this information to choose exactly what type of explanation the user needs. All the explanations are kept track of by the dialogue history to decrease repetition in the agent's utterances and behaviors.


C. ExplanAgent Planner

The final component of the environment architecture is the "ExplanAgent" Planner. Based on the information provided by the Pedagogical Planner, the ExplanAgent Planner determines the sequence of utterances and behaviors that the agent should perform in order to effectively explain the current situation to the user. This is accomplished by incorporating libraries of explanation knowledge, behavior knowledge, and domain knowledge into the explanation template provided by the Pedagogical Planner.

The focus of our work is on the second and third components of the system architecture. We worked to implement the diagnostic system and its necessary sub-components. In addition, we composed rules to generate the sequenced behaviors and utterances in the Explanation Planner. In the Explanation Planner, we have implemented an explanation template generator and an explanation instantiator. The template generator provides a sequence of utterance categories. That sequence is then used by the explanation instantiator to choose the specific audio clips and behaviors. Our current implementation deals with situations where an explanation is always required of the agent. In other words, our work is only used when the user makes a selection that is pertinent to the problem as opposed to actions that don't affect the problem state (such as pushing buttons or moving the mouse).






III. Coordinated Speech and Actions

A. CLIPS

Our implementation of coordinated speech and actions for animated pedagogical agents uses the expert systems language CLIPS. CLIPS allows us to implement a rule-based program which controls and manages both the Pedagogical Planner and the ExplanAgent Planner. Rule-based systems contain three basic components:

  1. Fact List -- the specific data that represents the state of the environment
  2. Knowledge Base -- the rules which govern how the environment changes
  3. Inference Engine -- executes the rules based on the current data

The component of this system that we are most interested in is the set of rules in the knowledge base. These rules are in the form of IF...THEN statements that usually have multiple premises and multiple conclusions. One advantage of using rule-based systems instead of large IF...THEN statements is the ability to re-trigger the same rules simply by reasserting facts. In addition, the forward chaining capabilities of CLIPS allows for future work on the system to incorporate rules which adapt and learn about the user during execution.


B. Our Implementation

The first component that we developed was the diagnostic system in the Pedagogical Planner. The purpose of the diagnostic system is to compare the user's solution with the correct solution and determine what type of explanation is needed. Our diagnostic system uses rules that separate the user's input into the various combinations of environment characteristics that are possible in the Internet Protocol Advisor.


Figure 2. Architecture for coordinating speech and actions


The system currently teaches four topics about information traveling over the internet: address resolution, differences between subnet types, computers versus routers, and congestion issues. Each problem in the environment has a main topic of focus and a secondary topic. The cases which the diagnostic system acknowledges and responds to are based on the possible combinations of user solutions and main problem topics. There are currently eleven possible scenarios, including the situation where the user has chosen the perfect answer and the situation where the user has chosen the worst possible answer. After determining which case the user's input falls under, the Pedagogical Planner updates the user model based on the solution given. For both the main topic and the secondary topic, correct answers increase the user's knowledge of that topic and incorrect answers are tallied. Once the user model has been updated, the diagnostic system provides the ExplanAgent with a problem state and an explanation type based on the problem case.


Figure 3. Problem explanation tree


The ExplanAgent Planner uses the problem state and the explanation type to sequence the agent's speech and actions. The explanation planner involves two sub-components, the explanation template generator and the explanation instantiator. The explanation template generator uses the explanation type provided by the diagnostic system to create a sequence template for the agent's behaviors. These templates do not select the actual behaviors. Rather, they simply create a template that provides general guidelines for the explanation instantiator to use. An example of an explanation template would be to give the user a cause and effect explanation of the main incorrect topic followed by background information and an encouraging remark. The benefit of using explanation templates is the ability to add new sequences without changing the explanation instantiator. There are currently nine categories of utterances that can be included in explanation templates. They are: congratulatory remarks; statements of correct topics; background information; cause, effect, and rationale for incorrect choices; assistance remarks; positive linking phrases; and negative linking phrases. The explanation template generator uses these categories of utterances to create coherent and useful explanations for the user.

Once the explanation template generator has determined the template, the explanation instantiator uses that information to choose actual phrases and behaviors for the agent. Each category of utterances that can appear in the template is controlled by one rule in the knowledge base. When one of the category rules is fired, it uses the dialogue history and the libraries of behavior knowledge and explanation knowledge to select an appropriate utterance for the agent. For each category, different aspects of the problem state affect the calculation of explanations. Congratulatory remarks are general and do not require any information from the problem state. Statements of correct topics access the problem state to determine which topic the user answered correctly. The background information deals with the main topic of the current problem. Background information also uses the user model to determine how much prior knowledge the user has on the main topic. If the user has exhibited extensive knowledge of the topic already, then the explanation instantiator assumes the mistake was careless and decides not to provide background information to the user. Cause, effect, and rationale statements can either emphasize the main topic of the problem or the secondary topic of the problem. Assistance remarks are the most complicated of the utterances. The assistance rule uses the user model to determine the number of mistakes the user has made on the chosen topic. It then decides whether to give the user a simple encouragement, a vague hint, or a detailed hint. The linking phrases are categorized as either positive or negative, where positive links are used between conjunctive phrases (e.g. and, also) and negative links are used between disjunctive phrases (e.g. but, however).

Each category may incorporate agent actions along with the chosen utterances. These actions are based on the length and focus of the utterances. If the agent's explanation is short and deals with a specific object in the environment, the action chosen will cause the agent to point at the focus object. However, if the explanation does not focus on a particular object, the agent may perform a general motion such as waving its hands in the air. For very short utterances, the agent performs quick "twitches."






IV. Example Scenario

The four topics covered in the Internet Protocol Advisor are address resolution, subnet types, computers versus routers, and congestion issues. In this scenario, the past two problems the user has been given dealt mainly with subnet type, and the user answered both of those problems incorrectly. The user is now presented with a problem where the main topic is subnet type and the secondary topic is address resolution. The user selects a computer with a good address but on a slow subnet (ethernet) with low congestion. This is the third time the user has incorrectly chosen a slow subnet. We will assume that the user has no previous knowledge about any of the topics and the only mistakes the user has made are the two mistakes dealing with subnet types.


Figure 4. Cosmo from the Internet Protocol Advisor


First, the user's input is reported to the diagnostic system in the Pedagogical Planner. The diagnostic system determines that this solution falls in Case 3 and therefore asserts a problem state with two correct topics, congestion and address resolution, and one incorrect topic (which is the main topic of the problem), subnet type. The Pedagogical Planner updates the user model to reflect an increase in the user's knowledge of address resolution and congestion, and yet another mistake on subnet types. The diagnostic system then sets the explanation type to "misconception 1," meaning that the user has a misconception about one topic.

Once the diagnostic system has calculated the problem state and the explanation type, the ExplanAgent Planner begins determining the actual explanation that will be presented. The explanation template generator uses the explanation type of "misconception 1" to create the following explanation template: state correct topic 1, positive link, state correct topic 2, congratulatory remark, negative link, cause, effect, and rationale for the incorrect topic, background information, and assistance remark. This template is then passed to the explanation instantiator where all the individual behaviors in the template are calculated.

First, Cosmo will tell the student which parts of their choice were correct. To do this, the explanation instantiator uses the problem state to get the correct topics, checks to make sure the agent is capable of performing behaviors that deal with those topics, and then chooses which phrases and behaviors relate to the correct topics. In this case, the student was correct on the topics of congestion and address resolution, and so the explanation instantiator chooses the following phrases: "this subnet has low traffic" and "you chose a reasonable address." To accompany these phrases, the instantiator chooses a pointing action and a general hand waving action. Then, congratulatory remarks and linking phrases are added. These remarks are independent of the problem state and the user model, so the explanation instantiator simply chooses one from each category. So after choosing the statements of correct topics, congratulatory remarks, and linking phrases, the explanation instantiator has created the phrase and actions: "[point to subnet] This subnet has low traffic, [twitch] and [hand wave] you chose a reasonable address. [clap] Good job! [shrug shoulders] However..."

Next, Cosmo needs to tell the student about the mistakes that were made. Therefore, the next four categories in the explanation template are cause, effect, rationale, and background for the incorrect topic. The cause statement is more general than the other three and is used to point out the option the student chose. The effect and rationale use the problem state to determine the incorrect topic and then offer insight into how and why another solution would be better. In this case, since the incorrect topic is subnet types, the rationale will address the actual incorrect subnet type. These rules in the instantiator select the phrase: "Suppose we sent the packet here, what will happen? It might get there, but it will take a long time because optical fiber is much faster than ethernet." The next category in the explanation template is background information. In addition to getting the incorrect topic from the problem state, the background rule accesses the user model to determine how much knowledge the user has about the incorrect topic. Since the user apparently has no knowledge about subnet types, the explanation instantiator will proceed to give the user background information about subnet types. The phrase it selects is: "An ethernet is a type of subnet." This phrase is not intended to affect the user's next choice directly, but is an attempt to give the user some general information about the incorrect topic. For these phrases, general hand waving motions are selected.

Finally, Cosmo wants to assist the student in making a better choice. The assistance remark uses the problem state and the user model to determine the incorrect topic and any knowledge or mistakes the user may have exhibited. Because this is the third time the user has incorrectly chosen a slow subnet, the assistance rule selects a vague hint to help the user understand the differences in subnet types. The vague hint that is chosen is: "We want to concentrate on picking the fastest subnet." If no mistakes had been made in the past on subnet type, an encouraging statement such as "Try again" would have been chosen. If the user incorrectly chooses the subnet type in the future, the assistance rule will produce a detailed hint such as "We want to choose a fiber optic subnet instead of an ethernet subnet." Since the last portion of the explanation is long and informational, the actions chosen are hand motions similar to the hand wave for the second phrase. These actions are added so the agent will not be still for this period of the explanation.

After all of the explanation instantiator's rules have fired, the ExplanAgent Planner assigns all the selected phrases and behaviors into the behavior array that is passed to the Interaction Manager. The Presentation Manager then displays the chosen sequence of coordinated speech and actions to the user. So after choosing a computer with a good address on an ethernet subnet that has low congestion, Cosmo will offer the following explanation to the user:

"[point to subnet] This subnet has low traffic, [twitch] and [hand wave] you chose a reasonable address. [clap] Good job! [shrug shoulders] However, [hand motion] suppose we sent the packet here, what will happen? It might get there, but it will take a long time because optical fiber is much faster than ethernet. [hand motion] An ethernet is a type of subnet. We want to concentrate on picking the fastest subnet."






V. Conclusions

By coordinating speech and actions for animated pedagogical agents in knowledge-based learning environments, the believability of such agents is increased. Our rule-based implementation of this coordination process produces coherent and useful sequences of utterances and behaviors that enhance the educational capabilities of the learning environment. Animated agents that utilize these believable sequences of behaviors are more personable and flexible. Our implementation uses a diagnostic system to determine what type of explanation the user requires and an explanation planner to calculate the specific behaviors the agent performs. The implementation allows for future development through the addition of more problem cases, explanation types, explanation templates, and utterance categories.

Animated agents offer a great opportunity to the world of educational software. They still have many hurdles to overcome, such as believability and adaptivity. Once they have matured, however, they will be a great asset to the educational process. Eventually, knowledge-based learning environments inhabited by animated pedagogical agents will become effective learning tools for our society.