Coordinating Speech and Actions for
Animated Pedagogical Agents
Jason L. Elliott
May 6, 1997
Table of Contents
- Introduction
- Architecture
- Interaction Manager
- Pedagogical Planner
- ExplanAgent Planner
- Coordinated Speech and Actions
- CLIPS
- Our Implementation
- Example Scenario
- Conclusions
I. Introduction
The United States has been experiencing a crisis in education. In
international surveys, the U.S. falls at the bottom of the rankings in quality
and effectiveness of education. This problem can not be allowed to continue.
One approach to solving the education crisis involves using computers to assist
teachers in normal classroom education. Computer tutoring systems have been
around for several decades, but the recent combination of entertainment and
education software systems has markedly improved the effectiveness of
intelligent "edutainment."
One of the key elements in edutainment software packages is the presence
of an agent that is believable and entertaining. This animated agent is the
"teacher" in the software package. It is used for both aural and visual
communications of the subject material. However, the agent can not be just a
puppet on strings that performs certain actions. Artificial intelligence can
be coupled with the agent's visual and aural aspects to create an animated
pedagogical agent. This new breed of agents is capable of responding to
students' needs on an individual basis. The AI techniques used can allow the
agent to adapt and react differently for each student that it encounters.
Animated pedagogical agents provide the users with a higher level of
interaction than in previous intelligent tutoring systems. Students will be
more likely to learn the subject material if they feel some connection with the
agent or teacher who is presenting the material. Since the connection to the
agent is the major factor in the effectiveness of the system, the agent must
exhibit the social qualities that humans generally expect from one another.
One of the main obstacles for an animated pedagogical agent is
believability. Humans expect a certain degree of emotion from the people they
come into contact with; in order for the agent to be significant in the learning
process, it must display some of those reactions and behaviors that are normally
present in human relationships. These characteristics can be accomplished by
coordinating the agent's speech and actions. The things that the agent says and
does must fit together in a way that appears normal to the student. Without the
coordination of speech and action, the agent would appear to be pre-programmed
and impersonal. The difficulty in providing these sequences of speech and
action is that they must not only be seamless and "make sense," but they must
actually instruct the student and be of some use to the subject being taught.
Together with
James Lester,
Stuart Towns, and the
Intellimedia Initiative at
North Carolina State University,
I have worked to implement the
coordination of speech and actions for animated pedagogical agents. The
Intellimedia Initiative
is currently implementing a design-centered learning
environment that is inhabited by an animated pedagogical agent named
Cosmo.
The purpose of this environment, home of the Internet Protocol Advisor,
is to teach
college-level students how packets of information are transferred across the
internet. By using the inference engine CLIPS, Stuart and I have created a set
of rules that govern the utterances and behaviors of Cosmo. By coordinating
Cosmo's speech and actions based on a user model, the problem state, and the
input or lack of input from the user, we hope to create a believable and highly
effective animated pedagogical agent that can provide educators with a useful
teaching tool.
This paper continues by describing the basic architecture by which our
knowledge-based learning environment is constructed. It then proceeds to give
detailed information on the implementation of coordinated speech and actions for
animated pedagogical agents. Following the implementation section, there is an
example of how the Pedagogical Planner and the
ExplanAgent Planner utilize our
work to explain problems in the Internet Protocol Advisor environment. The
thesis concludes with some observations about knowledge-based learning
environments.
II. Architecture
The Internet Protocol Advisor environment is based upon an architecture
that contains three major components. These three components are: the
Interaction Manager, the Pedagogical Planner, and the ExplanAgent Planner. Each
of these sub-systems plays a part in the user's experience.
Figure 1 shows the
complete architecture for this knowledge-based learning environment.
- Interaction Manager -- deals with user input and output
- Pedagogical Planner -- determines strategy for problem sequencing
- ExplanAgent Planner -- controls how the agent explains the current problem
Figure 1. Learning Environment Architecture
A. Interaction Manager
The Interaction Manager handles all input and output from the system.
The Interaction Manager contains an action interpreter that categorizes user
input and relays that information to the other parts of the environment. After
the user input has been processed and all explanation and agent behaviors have
been chosen, the presentation engine updates the environment using the selected
media. The Interaction Manager makes use of libraries of agent behaviors,
utterances, and interface options. Currently, the interface to the Interaction
Manager utilizes only mouse and keyboard input, but future implementations could
include joysticks, virtual reality equipment, or any other input device.
Figure 4 shows the interface for the Internet Protocol Advisor.
B. Pedagogical Planner
The second component of the architecture is the Pedagogical Planner.
This unit determines the characteristics of the problems that the environment
presents to the user. By using a diagnostic system in conjunction with the user
model, problem state, dialogue history, and other problem generation knowledge,
the Pedagogical Planner can create a new problem that is appropriate for the
user. When the user attempts to solve a problem, the diagnostic system compares
the user's solution with the correct solution and determines what type of
explanation is needed to help the user understand the problem. Contained in the
user model is information about the user's knowledge of the focus topic and any
mistakes the user has made. The problem state provides the topic being
emphasized in this problem, the characteristics of the user's selection, and the
correct selection for this problem. The diagnostic system uses all of this
information to choose exactly what type of explanation the user needs. All the
explanations are kept track of by the dialogue history to decrease repetition in
the agent's utterances and behaviors.
C. ExplanAgent Planner
The final component of the environment architecture is the "ExplanAgent"
Planner. Based on the information provided by the Pedagogical Planner, the
ExplanAgent Planner determines the sequence of utterances and behaviors that the
agent should perform in order to effectively explain the current situation to
the user. This is accomplished by incorporating libraries of explanation
knowledge, behavior knowledge, and domain knowledge into the explanation
template provided by the Pedagogical Planner.
The focus of our work is on the second and third components of the
system architecture. We worked to implement the diagnostic system and its
necessary sub-components. In addition, we composed rules to generate the
sequenced behaviors and utterances in the Explanation Planner. In the
Explanation Planner, we have implemented an explanation template generator and
an explanation instantiator. The template generator provides a sequence of
utterance categories. That sequence is then used by the explanation
instantiator to choose the specific audio clips and behaviors. Our current
implementation deals with situations where an explanation is always required of
the agent. In other words, our work is only used when the user makes a
selection that is pertinent to the problem as opposed to actions that don't
affect the problem state (such as pushing buttons or moving the mouse).
III. Coordinated Speech and Actions
A. CLIPS
Our implementation of coordinated speech and actions for animated
pedagogical agents uses the expert systems language
CLIPS. CLIPS
allows us to implement a rule-based program which controls and manages
both the Pedagogical Planner and the ExplanAgent Planner. Rule-based
systems contain three basic components:
- Fact List -- the specific data that represents the state of the environment
- Knowledge Base -- the rules which govern how the environment changes
- Inference Engine -- executes the rules based on the current data
The component of this system that we are most interested in is the set
of rules in the knowledge base. These rules are in the form of IF...THEN
statements that usually have multiple premises and multiple conclusions. One
advantage of using rule-based systems instead of large IF...THEN statements is
the ability to re-trigger the same rules simply by reasserting facts. In
addition, the forward chaining capabilities of CLIPS allows for future work on
the system to incorporate rules which adapt and learn about the user during
execution.
B. Our Implementation
The first component that we developed was the diagnostic system in the
Pedagogical Planner. The purpose of the diagnostic system is to compare the
user's solution with the correct solution and determine what type of explanation
is needed. Our diagnostic system uses rules that separate the user's input
into the various combinations of environment characteristics that are possible
in the Internet Protocol Advisor.
Figure 2. Architecture for coordinating speech and actions
The system currently teaches four topics about information traveling over the
internet: address resolution, differences between subnet types, computers
versus routers, and congestion issues. Each problem in the environment has a
main topic of focus and a secondary topic. The cases which the diagnostic
system acknowledges and responds to are based on the possible combinations of
user solutions and main problem topics. There are currently eleven possible
scenarios, including the situation where the user has chosen the perfect answer
and the situation where the user has chosen the worst possible answer. After
determining which case the user's input falls under, the Pedagogical Planner
updates the user model based on the solution given. For both the main topic and
the secondary topic, correct answers increase the user's knowledge of that topic
and incorrect answers are tallied. Once the user model has been updated, the
diagnostic system provides the ExplanAgent with a problem state and an
explanation type based on the problem case.
Figure 3. Problem explanation tree
The ExplanAgent Planner uses the problem state and the explanation type
to sequence the agent's speech and actions. The explanation planner involves
two sub-components, the explanation template generator and the explanation
instantiator. The explanation template generator uses the explanation type
provided by the diagnostic system to create a sequence template for the agent's
behaviors. These templates do not select the actual behaviors. Rather, they
simply create a template that provides general guidelines for the explanation
instantiator to use. An example of an explanation template would be to give the
user a cause and effect explanation of the main incorrect topic followed by
background information and an encouraging remark. The benefit of using
explanation templates is the ability to add new sequences without changing the
explanation instantiator. There are currently nine categories of utterances
that can be included in explanation templates. They are: congratulatory
remarks; statements of correct topics; background information; cause, effect,
and rationale for incorrect choices; assistance remarks; positive linking
phrases; and negative linking phrases. The explanation template generator uses
these categories of utterances to create coherent and useful explanations for
the user.
Once the explanation template generator has determined the template, the
explanation instantiator uses that information to choose actual phrases and
behaviors for the agent. Each category of utterances that can appear in the
template is controlled by one rule in the knowledge base. When one of the
category rules is fired, it uses the dialogue history and the libraries of
behavior knowledge and explanation knowledge to select an appropriate utterance
for the agent. For each category, different aspects of the problem state affect
the calculation of explanations. Congratulatory remarks are general and do not
require any information from the problem state. Statements of correct topics
access the problem state to determine which topic the user answered correctly.
The background information deals with the main topic of the current problem.
Background information also uses the user model to determine how much prior
knowledge the user has on the main topic. If the user has exhibited extensive
knowledge of the topic already, then the explanation instantiator assumes the
mistake was careless and decides not to provide background information to the
user. Cause, effect, and rationale statements can either emphasize the main
topic of the problem or the secondary topic of the problem. Assistance remarks
are the most complicated of the utterances. The assistance rule uses the user
model to determine the number of mistakes the user has made on the chosen topic.
It then decides whether to give the user a simple encouragement, a vague hint,
or a detailed hint. The linking phrases are categorized as either positive or
negative, where positive links are used between conjunctive phrases (e.g. and,
also) and negative links are used between disjunctive phrases (e.g. but, however).
Each category may incorporate agent actions along with the chosen
utterances. These actions are based on the length and focus of the utterances.
If the agent's explanation is short and deals with a specific object in the
environment, the action chosen will cause the agent to point at the focus
object. However, if the explanation does not focus on a particular object, the
agent may perform a general motion such as waving its hands in the air. For
very short utterances, the agent performs quick "twitches."
IV. Example Scenario
The four topics covered in the Internet Protocol Advisor are address
resolution, subnet types, computers versus routers, and congestion issues. In
this scenario, the past two problems the user has been given dealt mainly with
subnet type, and the user answered both of those problems incorrectly. The user
is now presented with a problem where the main topic is subnet type and the
secondary topic is address resolution. The user selects a computer with a good
address but on a slow subnet (ethernet) with low congestion. This is the third
time the user has incorrectly chosen a slow subnet. We will assume that the
user has no previous knowledge about any of the topics and the only mistakes
the user has made are the two mistakes dealing with subnet types.
Figure 4. Cosmo from the Internet Protocol Advisor
First, the user's input is reported to the diagnostic system in the
Pedagogical Planner. The diagnostic system determines that this solution falls
in Case 3 and therefore asserts a problem state with two correct topics,
congestion and address resolution, and one incorrect topic (which is the main
topic of the problem), subnet type. The Pedagogical Planner updates the user
model to reflect an increase in the user's knowledge of address resolution and
congestion, and yet another mistake on subnet types. The diagnostic system then
sets the explanation type to "misconception 1," meaning that the user has a
misconception about one topic.
Once the diagnostic system has calculated the problem state and the
explanation type, the ExplanAgent Planner begins determining the actual
explanation that will be presented. The explanation template generator uses the
explanation type of "misconception 1" to create the following explanation
template: state correct topic 1, positive link, state correct topic 2,
congratulatory remark, negative link, cause, effect, and rationale for the
incorrect topic, background information, and assistance remark. This template
is then passed to the explanation instantiator where all the individual
behaviors in the template are calculated.
First, Cosmo will tell the student which parts of their choice were
correct. To do this, the
explanation instantiator uses the problem state to get the correct topics,
checks to make
sure the agent is capable of performing behaviors that deal with those topics,
and then chooses which phrases and behaviors relate to the correct topics. In
this case, the student was correct on the topics of congestion and
address resolution, and so the explanation instantiator chooses the
following phrases: "this subnet has low traffic" and "you chose a
reasonable address." To accompany
these phrases, the instantiator chooses a pointing action and a general hand
waving action. Then, congratulatory remarks and linking phrases are
added. These remarks are independent
of the problem state and the user model, so the explanation instantiator simply
chooses one from each category. So after choosing the statements of correct
topics, congratulatory remarks, and linking phrases, the explanation
instantiator has created the phrase and actions: "[point to subnet] This
subnet has low
traffic, [twitch] and [hand wave] you chose a reasonable address. [clap] Good
job! [shrug shoulders] However..."
Next, Cosmo needs to tell the student about the mistakes that were
made. Therefore, the next four categories in the
explanation template are cause, effect, rationale, and background for the
incorrect topic.
The cause statement is more general than the other three and is used to
point out
the option the student chose. The effect and rationale use the problem state
to determine the incorrect topic and then offer insight into how and why another
solution would be better. In this case, since the incorrect topic is
subnet
types, the rationale will address the actual incorrect subnet type. These rules
in the instantiator select the phrase: "Suppose we sent the packet here,
what will happen? It might
get there, but it will take a long time because optical fiber is much faster
than ethernet." The next category in the explanation template is background
information. In addition to getting the incorrect topic from the problem state,
the background rule accesses the user model to determine how much knowledge the
user has about the incorrect topic. Since the user apparently has no knowledge
about subnet types, the explanation instantiator will proceed to give the
user
background information about subnet types. The phrase it selects is: "An
ethernet is a type of subnet." This phrase is not intended to affect the user's
next choice directly, but is an attempt to give the user some general
information about the incorrect topic. For these phrases, general hand
waving motions are selected.
Finally, Cosmo wants to assist the student in making a better choice. The
assistance remark uses the problem state
and the user model to determine the incorrect topic and any knowledge or
mistakes the user may have exhibited. Because this is the third time the user
has incorrectly chosen a slow subnet, the assistance rule selects a vague hint
to help the user understand the differences in subnet types. The vague hint
that is chosen is: "We want to concentrate on picking the fastest
subnet." If no mistakes had been made in the past on subnet type, an
encouraging statement such as "Try again" would have been chosen. If
the user incorrectly chooses the subnet type in the future, the assistance
rule will
produce a detailed hint such as "We want to choose a fiber optic subnet instead
of an ethernet subnet." Since the last portion of the explanation is long and
informational, the actions chosen are hand motions similar to the hand wave for
the second phrase. These actions are added so the agent will not be still for
this period of the explanation.
After all of the explanation instantiator's rules have fired, the
ExplanAgent Planner assigns all the selected phrases and behaviors into the
behavior array that is passed to the Interaction Manager. The Presentation
Manager then displays the chosen sequence of coordinated speech and actions to
the user. So after choosing a computer with a good address on an ethernet
subnet that has low congestion, Cosmo will offer the following explanation to
the user:
"[point to subnet] This subnet has low traffic, [twitch] and [hand wave]
you chose a reasonable address. [clap] Good job! [shrug shoulders] However,
[hand motion] suppose we sent the packet here, what will happen? It might get
there, but it will take a long time because optical fiber is much faster than
ethernet. [hand motion] An ethernet is a type of subnet. We want to
concentrate on picking the fastest subnet."
V. Conclusions
By coordinating speech and actions for animated pedagogical agents in
knowledge-based learning environments, the believability of such agents is
increased. Our rule-based implementation of this coordination process produces
coherent and useful sequences of utterances and behaviors that enhance the
educational capabilities of the learning environment. Animated agents that
utilize these believable sequences of behaviors are more personable and
flexible. Our implementation uses a diagnostic system to determine what type
of explanation the user requires and an explanation planner to calculate the
specific behaviors the agent performs. The implementation allows for future
development through the addition of more problem cases, explanation types,
explanation templates, and utterance categories.
Animated agents offer a great opportunity to the world of educational
software. They still have many hurdles to overcome, such as believability and
adaptivity. Once they have matured, however, they will be a great asset to the
educational process. Eventually, knowledge-based learning environments
inhabited by animated pedagogical agents will become effective learning tools
for our society.