~~Past approaches for using reinforcement learning to derive dialog control policieshave assumed that there was enough col lected data to derive a reliable policy.~~
~~In this paper we present a methodology fornumerically constructing confidence inter vals for the expected cumulative reward for a learned policy.~~
~~These intervals are used to (1) better assess the reliability of the expected cumulative reward, and (2) perform a refined comparison between policies derived from different Markov Decision Processes (MDP) models.~~
~~Weapplied this methodology to a prior ex periment where the goal was to select thebest features to include in the MDP state space.~~
~~Our results show that while some of the policies developed in the prior work exhibited very large confidence intervals, the policy developed from the best feature set had a much smaller confidence interval and thus showed very high reliability.~~
~~NLP researchers frequently have to deal with issuesof data sparsity.~~
~~Whether the task is machine transla tion or named-entity recognition, the amount of dataone has to train or test with can greatly impact the re liability and robustness of one?s models, results and conclusions.~~
~~One research area that is particularly sensitive tothe data sparsity issue is machine learning, specifi cally in using Reinforcement Learning (RL) to learn the optimal action for a dialogue system to makegiven any user state.~~
~~Typically this involves learn ing from previously collected data or interacting in real-time with real users or user simulators.~~
~~One ofthe biggest advantages to this machine learning approach is that it can be used to generate optimal poli cies for every possible state.~~
~~However, this method requires a thorough exploration of the state-space to make reliable conclusions on what the best actionsare.~~
~~States that are infrequently visited in the train ing set could be assigned sub-optimal actions, and therefore the resulting dialogue manager may not provide the best interaction for the user.In this work, we present an approach for estimating the reliability of a policy derived from collected training data.~~
~~The key idea is to take into ac count the uncertainty in the model parameters (MDP transition probabilities), and use that information to numerically construct a confidence interval for the expected cumulative reward for the learned policy.~~
~~This confidence interval approach allows us to: (1)better assess the reliability of the expected cumulative reward for a given policy, and (2) perform a refined comparison between policies derived from dif ferent MDP models.~~
~~We apply the proposed approach to our previous work (Tetreault and Litman, 2006) in using RL to improve a spoken dialogue tutoring system.~~
~~In thatwork, a dataset of 100 dialogues was used to de velop a methodology for selecting which user state features should be included in the MDP state-space.~~
~~But are 100 dialogues enough to generate reliablepolicies?~~
~~In this paper we apply our confidence in 276terval approach to the same dataset in an effort to in vestigate how reliable our previous conclusions are, given the amount of available training data.~~
~~In the following section, we discuss the prior work and its data sparsity issue.~~
~~In section 3, wedescribe in detail our confidence interval methodol ogy.~~
~~In section 4, we show how this methodology works by applying it to the prior work.~~
~~In sections 5 and 6, we present our conclusions and future work.~~
~~Past research into using RL to improve spoken dialogue systems has commonly used Markov Deci sion Processes (MDP?s) (Sutton and Barto, 1998) to model a dialogue (such as (Levin and Pieraccini, 1997) and (Singh et al, 1999)).A MDP is defined by a set of states {si}i=1..n,a set of actions {ak}k=1..p, and a set of transitionprobabilities which reflect the dynamics of the environment {p(si|sj, ak)}k=1..pi,j=1..n: if the model is attime t in state sj and takes action ak, then it willtransition to state si with probability p(si|sj , ak).Additionally, an expected reward r(si, sj , ak) is de-fined for each transition.~~
~~Once these model parameters are known, a simple dynamic programming approach can be used to learn the optimal control pol icy pi?, i.e. the set of actions the model should take at each state, to maximize its expected cumulative reward.~~
~~The dialog control problem can be naturally cast in this formalism: the states {si}i=1..n in the MDPcorrespond to the dialog states (or an abstraction thereof), the actions {ak}k=1..p correspond to theparticular actions the dialog manager might take, and the rewards r(si, sj , ak) are defined to reflecta particular dialog performance metric.~~
~~Once theMDP structure has been defined, the model param eters {p(si|sj, ak)}k=1..pi,j=1..n are estimated from a cor-pus of dialogs (either real or simulated), and, based on them, the policy which maximizes the expected cumulative reward is computed.While most work in this area has focused on de veloping the best policy (such as (Walker, 2000), (Henderson et al, 2005)), there has been relatively little work done with respect to selecting the bestfeatures to include in the MDP state-space.~~
~~For in stance, Singh et al (1999) showed that dialogue length was a useful state feature and Frampton and Lemon (2005) showed that the user?s last dialogueact was also useful.~~
~~In our previous work, we com pare the worth of several features.~~
~~In addition, Paekand Chickering?s (2005) work showed how a state space can be reduced by only selecting features that are relevant to maximizing the reward function.~~
~~The motivation for this line of research is that ifone can properly select the most informative features, one develops better policies, and thus a bet ter dialogue system.~~
~~In the following sections wesummarize our past data, approach, results, and is sue with policy reliability.~~
~~2.1 MDP Structure.~~
~~For this study, we used an annotated corpus of human-computer spoken dialogue tutoring sessions.~~
~~The fixed-policy corpus contains data collected from20 students interacting with the system for five prob lems (for a total of 100 dialogues of roughly 50 turnseach).~~
~~The corpus was annotated with 5 state fea tures (Table 1).~~
~~It should be noted that two of the features, Certainty and Frustration, were manuallyannotated while the other three were done automat ically.~~
~~All features are binary except for Certainty which has three values.~~
~~State Values Correctness Student is correct or incorrect in the current turn Certainty Student is certain, neutral or uncertain in the current turn Concept Repetition A particular concept is either new or repeated Frustration Student is frustrated or not in the current turn Percent Correct Student answers over 66% of questions correctly in dialogue so far, or less Table 1: State Features in Tutoring Corpus For the action set {ak}k=1..p, we looked at whattype of question the system could ask the student given the previous state.~~
~~There are a total of four possible actions: ask a short answer question (onethat requires a simple one word response), a com plex answer question (one that requires a longer, deeper response), ask both a simple and complex question in the same turn, or do not ask a question at all (give a hint).~~
~~The reward function r was the 277 learning gain of each student based on a pair of tests before and after the entire session of 5 dialogues.~~
~~The 20 students were split into two groups (high and low learners) based on their learning gain, so 10 students and their respective five dialogues weregiven a positive reward of +100, while the remain der were assigned a negative reward of -100.~~
~~The rewards were assigned in the final dialogue state, acommon approach when applying RL in spoken di alogue systems.~~
~~2.2 Approach and Results.~~
~~To investigate the usefulness of different features, we took the following approach.~~
~~We started with two baseline MDPs.~~
~~The first model (Baseline 1) used only the Correctness feature in the state-space.~~
~~The second model (Baseline 2) included both theCorrectness and Certainty features.~~
~~Next we constructed 3 new models by adding each of the remain ing three features (Frustration, Percent Correct and Concept Repetition) to the Baseline 2 model.~~
~~We defined three metrics to compare the policies derived from these MDPs: (1) Diff?s: the number ofstates whose policy differs from the Baseline 2 pol icy, (2) Percent Policy change (P.C.): the weighted amount of change between the two policies (100%indicates total change), and (3) Expected Cumula tive Reward (or ECR) which is the average rewardone would expect in that MDP when in the state space.The intuition is that if a new feature were relevant, the corresponding model would lead to a different policy and a better expected cumulative reward (when compared to the baseline models).~~
~~Con versely, if the features were not useful, one would expect that the new policies would look similar (specifically, the Diff?s count and % Policy Changewould be low) or produce similar expected cumula tive rewards to the original baseline policy.~~
~~The results of this analysis are shown in Table 2 1 The Diff?s and Policy Change metrics are undefined for the two baselines since we only use these twometrics to compare the other three features to Base 1Please note that to due to refinements in code, there is a slight difference between the ECR?s reported in this work and the ECR?s reported in the previous work, for the three features added to Baseline 2.~~
~~These changes did not alter the rankings of these models, or the conclusions of the previous work.~~
~~line 2.~~
~~All three metrics show that the best feature to add to the Baseline 2 model is Concept Repetition since it results in the most change over the Baseline 2 policy, and also the expected reward is the highest as well.~~
~~For the remainder of this paper, when we refer to Concept Repetition, Frustration, or PercentCorrectness, we are referring to the model that in cludes that feature as well as the Baseline 2 features Correctness and Certainty.~~
~~State Feature # Diff?s % P.C. ECR Baseline 1 N/A N/A 6.15 Baseline 2 N/A N/A 31.92 B2 + Concept Repetition 10 80.2% 42.56 B2 + Frustration 8 66.4% 32.99 B2 + Percent Correctness 4 44.3% 28.50 Table 2: Feature Comparison Results 2.3 Problem with Reliability.~~
~~However, the approach discussed above assumesthat given the size of the data set, the ECR and policies are reliable.~~
~~If the MDP model were very frag ile, that is the policy and expected cumulative reward were very sensitive to the quality of the transition probability estimates, then the metrics could revealquite different rankings.~~
~~Previously, we used a qual itative approach of tracking how the worth of each state (V-value) changed over time.~~
~~The V-values indicate how much reward one would expect fromstarting in that state to get to a final state.~~
~~We hypothesized that if the V-values stabilized as data increased, then the learned policy would be more reli able.So is this V-value methodology adequate for assessing if there is enough data to determine a sta ble policy, and also for assessing if one model isbetter than another?~~
~~Since our approach for statespace selection is based on comparing a new pol icy with a baseline policy, having a stable policy is extremely important since instability could lead todifferent conclusions.~~
~~For example, in one compar ison, a new policy could differ with the baseline in 8 out of 10 states.~~
~~But if the MDP were unstable,adding just a little more data could result in a differ ence of only 4 out of 10 states.~~
~~Is there an approach that can categorize whether given a certain data size, 278 that the expected cumulative reward (and thus the policy) is reliable?~~
~~In the next section we present anew methodology for numerically constructing con fidence intervals for these value function estimates.~~
~~Then, in the following section, we reevaluate our prior work with this methodology and discuss the results.~~
~~3.1 Policy Evaluation with Confidence.~~
~~Intervals The starting point for the proposed methodology is the observation that for each state sj and ac-tion ak in the MDP, the set of transition probabili-ties {p(si|sj, ak)}i=1..n are modeled as multinomialdistributions that are estimated from the transition counts in the training data: p?(si|sj, ak) = c(si, sj, ak) ?n i=1 c(si, sj , ak) (1) where n is the number of states in the model, and c(si, sj , ak) is the number of times the system wasin state sj , took action ak, and transitioned to state si in the training data.It is important to note that these parameters are just estimates.~~
~~The reliability of these estimates clearly depends on the amount of training data, morespecifically on the transition counts c(si, sj, ak).~~
~~Forinstance, consider a model with 3 states and 2 ac tions.~~
~~Say the model was in state s1 and took action a1 ten times.~~
~~Out of these, three times the modeltransitioned back to state s1, two times it transi-tioned to state s2, and five times to state s3.~~
~~Thenwe have: p?(si|s1, a1) = ?0.3; 0.2; 0.5?~~
~~= ? 310 ; 2 10 ; 5 10 ?~~
~~(2) Additionally, let?s say the same model was in state s2 and took action a2 1000 times.~~
~~Following that ac-tion, it transitioned 300 times to state s1, 200 timesto state s2, and 500 times to state s3.~~
~~p?(si|s2, a2) = ?0.3; 0.2; 0.5?~~
~~= ? 3001000 ; 200 1000 ; 500 1000 ?~~
~~(3) While both sets of transition parameters have thesame value, the second set of estimates is more reli able.~~
~~The central idea of the proposed approach is to model this uncertainty in the system parameters, and use it to numerically construct confidence intervals for the value of the optimal policy.~~
~~Formally, each set of transition probabilities {p(si|sj , ak)}i=1..n is modeled as a multinomial dis-tribution, estimated from data2.~~
~~The uncertainty of multinomial estimates are commonly modeled bymeans of a Dirichlet distribution.~~
~~The Dirichlet dis tribution is characterized by a set of parameters ?1,?2, ..., ?n, which in this case correspond to thecounts {c(si, sj , ak)}i=1..n. For any given j, thelikelihood of the set of multinomial transition pa rameters {p(si|sj, ak)}i=1..n is then given by: P ({p(si|sj , ak)}i=1..n|D) = = 1Z(D) ?n i=1 p(si|sj , ak)?i?1 (4) where Z(D) = ?n i=1 ?(?i) ?( ?n i=1 ?i) and ?i = c(si, sj , ak).~~
~~Note that the maximum likelihood estimates for the formula above correspond to the frequency count formula we have already described: p?ML(si|sj, ak) = ?i ?n i=1 ?i = c(si, sj, ak)?n i=1 c(si, sj , ak)(5)To capture the uncertainty in the model parame ters, we therefore simply need to store the countsof the observed transitions c(si, sj , ak).~~
~~Based onthis model of uncertainty, we can numerically construct a confidence interval for the value of the opti mal policy pi?.~~
~~Instead of computing the value of the policy based on the maximum likelihood transitionestimates T?ML = {p?ML(si|sj , ak)}k=1..pi,j=1..n, we generate a large number of transition matrices T?1, T?1,...~~
~~T?m by sampling from the Dirichlet distributionscorresponding to the counts observed in the train ing data (in the experiments reported in this paper, we used m = 1000).~~
~~We then compute the value of the optimal policy pi?~~
~~in each of these models {Vpi?(T?i)}i=1..m. Finally, we numerically constructthe 95% confidence interval for the value function based on the resulting value estimates: the bounds for the confidence interval are set at the lowest and highest 2.5 percentile of the resulting distribution of the values for the optimal policy {Vpi?(T?i)}i=1..m.The algorithm is outlined below: 2By p we will denote the true model parameters; by p?~~
~~we will denote data-driven estimates for these parameters 279 1.~~
~~compute transition counts from the training set: C = {c(si, sj, ak)}k=1..pi,j=1..n (6) 2.~~
~~compute maximum likelihood estimates for transition probability matrix: T?ML = {p?ML(si|sj , ak)}k=1..pi,j=1..n (7)3.~~
~~use dynamic programming to compute the op timal policy pi?~~
~~for model T?ML 4.~~
~~sample m transition matrices {T?k}k=1..m, us-ing the Dirichlet distribution for each row: {p?i(si|sj, ak)}i=1..n = = Dir({c(si, sj , ak)}i=1..n) (8) 5.~~
~~evaluate the optimal policy pi?~~
~~in each of these m models, and obtain Vpi?(T?i) 6.~~
~~numerically build the 95% confidence interval for Vpi?~~
~~from these estimates.To summarize, the central idea is to take into account the reliability of the transition probability estimates and construct a confidence interval for the ex pected cumulative reward for the learned policy.~~
~~Inthe standard approach, we would compute an esti mate for the expected cumulative reward, by simply using the transition probabilities derived from the training set.~~
~~Note that these transition probabilitiesare simply estimates which are more or less accu rate, depending on how much data is available.~~
~~Theproposed methodology does not fully trust these es timates, and asks the question: given that the realworld (i.e. real transition probabilities) might actu ally be a bit different than we think it is, how well can we expect the learned policy to perform?~~
~~Notethat the confidence interval we construct, and there fore the conclusions we draw, are with respect to the policy learned from the current estimates, i.e. fromthe current training set.~~
~~If more data becomes avail able, a different optimal policy might emerge, about which we cannot say much.~~
~~3.2 Related Work.~~
~~Given the stochastic nature of the models, confidence intervals are often used to estimate the reli ability of results in machine learning experiments, e.g.~~
~~(Rivals and Personnaz, 2002), (Schapire, 2002) and (Dumais et al, 1998).~~
~~In this work we use a confidence interval methodology in the context of MDPs.~~
~~The idea of modeling the uncertainty of the transition probability estimates using Dirichlet models also appears in (Jaulmes et al, 2005).~~
~~In that work, the authors used the uncertainty in model parameters to develop active learning strategies for partially observable MDPs, a topic not previously addressed in the literature.~~
~~In our work we rely onthe same model of uncertainty for the transition ma trix, but use it to derive confidence intervals for the expected cumulative reward for the learned optimal policy, in an effort to assess the reliability of this policy.~~
~~Our previous results indicated that Concept Repe tition was the best feature to add to the Baseline 2 state-space model, but also that Percent Correctness and Frustration (when added to Baseline 2) offeredan improvement over the Baseline MDP?s. However, these conclusions were based on a very quali tative approach for determining if a policy is reliableor not.~~
~~In the following subsection, we apply our approach of confidence intervals to empirically deter mine if given this data set of 100 dialogues, whether the estimates of the ECR are reliable, and whether the original rankings and conclusions hold up under this refined analysis.~~
~~In subsection 4.2, we provide a methodology for pinpointing when one model is better than another.~~
~~4.1 Quantitative Analysis of ECR Reliability.~~
~~For our first investigation, we look at the confidence intervals of each MDP?s ECR over the entire data set of 20 students (later in this section we show plots for the confidence intervals as data increases).~~
~~Table 3shows the upper and lower bounds for the ECR orig inally reported in Table 2.~~
~~The first column shows the original, estimated ECR of the MDP and the lastcolumn is the width of the bound (the difference be tween the upper and lower bound).So what conclusions can we make about the reli ability of the ECR, and hence of the learned policiesfor the different MDP?s, given this amount of train ing data?~~
~~The confidence interval for the ECR for 280 State Feature ECR Lower Bound Upper Bound Width Baseline 1 6.15 0.21 23.73 23.52 Baseline 2 (B2) 31.92 -5.31 60.48 65.79 B2 + Concept Repetition 42.56 28.37 59.29 30.92 B2 + Frustration 32.99 -4.12 61.30 65.42 B2 + Percent Correctness 28.50 -5.89 57.82 63.71 Table 3: Confidence Intervals with complete datasetthe Baseline 1 model ranges from 0.21 to 23.73.~~
~~Re call that the final states are capped at +100 and -100, and are thus the maximum and minimum bounds that one can see in this experiment.~~
~~These bounds tell us that, if we take into account the uncertainty in the model estimates (given the small training set size), with probability 0.95 the actual true ECR for this policy will be greater than 0.21 and smaller than 23.73.~~
~~The width of this confidence interval is 23.52.~~
~~For the Baseline 2 model, the bounds are much wider: from -5.31 to 60.48, for a total width of 65.79.~~
~~While the ECR estimate is 31.92 (which is seemingly larger than 6.15 for the Baseline 1 model), the wide confidence interval tells us that this estimate is not very reliable.~~
~~It is possible that the policy derived from this model with this amount of data could perform poorly, and even get a negativereward.~~
~~From the dialogue system designer?s stand point, a model like this is best avoided.Of the remaining three models ? Concept Repeti tion, Frustration, and Percent Correctness, the firstone exhibits a tighter confidence interval, indicat ing that the estimated expected cumulative reward (42.56) is fairly reliable: with 95% probability of being between 28.37 and 59.29.~~
~~The ECR for theother two models (Frustration and Percent Correct ness) again shows a wide confidence interval once we take into account the uncertainty in the model parameters.These results shed more light on the shortcom ings of the ECR metric used to evaluate the modelsin prior work.~~
~~This estimate does not take into ac count the uncertainty of the model parameters.~~
~~For example, a model can have an optimal policy witha very high ECR value, but have very wide confi dence bounds reaching even into negative rewards.On the other hand, another model can have a rela tively lower ECR but if its bounds are tighter (and the lower bound is not negative), one can know thatthat policy is less affected by poor parameter esti mates stemming from data sparsity issues.~~
~~Using the confidence intervals associated with the ECR gives amuch more refined, quantitative estimate of the reli ability of the reward, and hence of the policy derived from that data.An extension of this result is that confidence in tervals can also allow us to make refined judgments about the comparative utility of different features,the original motivation of our prior study.~~
~~Basi cally, a model (M1) is better than another (M2) if M1?s lower bound is greater than the upper bound of M2.~~
~~That is, one knows that 95% of the time, the worst case situation of M1 (the lower bound) will always yield a higher reward than the best case ofM2.~~
~~In our data, this happens only once, with Concept Repetition being empirically better than Base line 1, since the lower bound of Concept Repetition is 28.37 and the upper bound of Baseline 1 is 23.73.~~
~~Given this situation, Concept Repetition is a useful feature which, when included in the model, leads to a better policy than simply using Correctness.~~
~~Wecannot draw any conclusions about the other fea tures, since their bounds are generally quite wide.~~
~~Given this amount of training data, we cannot saywhether Percent Correctness and Frustration are bet ter features than the Baseline MDP?s. Although their ECR?s are higher, there is too much uncertainty to definitely conclude they are better.~~
~~4.2 Pinpointing Model Cross-over.~~
~~The previous analysis focused on a quantitative method of (1) determining the reliability of the MDP ECR estimate and policy, as well as (2) assessing whether one model is better than another.~~
~~In thissection, we present an extension to the second con tribution by answering the question: given that one model is more reliable than another, is it possible to determine at which point one model?s estimates become more reliable than another model?s? In our 281 0 2 4 6 8 10 12 14 16 18 20 ?100 ?80 ?60 ?40 ?20 0 20 40 60 80 100 Baseline 1 # of students EC R Confidence Bounds Calculated ECR 0 2 4 6 8 10 12 14 16 18 20 ?100 ?80 ?60 ?40 ?20 0 20 40 60 80 100 Baseline 2 +Concept Repetition # of students EC R Confidence Bounds Calculated ECR Figure 1: Confidence Interval Plotscase, we want to know at what point Concept Rep etition becomes more reliable than Baseline 1.~~
~~To do this, we investigate how the confidence intervalchanges as the amount of training data increases in stead of looking at the reliability estimate at only one particular data size.We incrementally increase the amount of train ing data (adding the data from one new student at atime), and calculate the corresponding optimal policy and confidence interval for the expected cumulative reward for that policy.~~
~~Figure 1 shows the con fidence interval plots as data is added to the MDP for the Baseline 1 and Concept Repetition MDP?s. For reference, Baseline 2, Percent Correctness andFrustration plots did not exhibit the same converg ing behavior as these two, which is not surprising given how wide the final bounds are.~~
~~For each plot, the bold lines represent the upper and lower bounds, and the dotted line represents the calculated ECR.Analyzing the two MDP?s, we find that the confidence intervals for Baseline 1 and Concept Repetition converge as more data is added, which is an ex pected trend.~~
~~One useful result from observing thechange in confidence intervals is that one can de termine the point in one which one model becomes empirically better than another.~~
~~Superimposing the upper and lower bounds (Figure 2) reveals that after we include the data from the first 13 students, the lower bound of Concept Repetition crosses over the upper bound of Baseline 1.~~
~~Observing this behavior is especially useful for performing model switching.~~
~~In automatic model switching, a dialogue manager runs in real time andas it collects data, it can switch from using a sim ple dialogue model to a complex model.~~
~~Confidence intervals can be used to determine when to switch from one model to the next by checking if a complex model?s bounds cross over the bounds of the current model.~~
~~Basically, the dialogue manager switches when it can be sure that the more complex model?s ECR is not only higher, but statistically significantly so.~~
~~0 2 4 6 8 10 12 14 16 18 20 ?50 0 50 100 # of students EC R Baseline 1 and Concept Repetition Superimposed Baseline 1 B2 + Concept Repetition Figure 2: Baseline 1 and Concept Repetition Bounds~~
~~Past work in using MDP?s to improve spoken dia logue systems have usually glossed over the issue ofwhether or not there was enough training data to de velop reliable policies.~~
~~In this work, we present a numerical method for building confidence intervalsfor the expected cumulative reward for a learned pol icy.~~
~~The proposed approach allows one to (1) better 282assess the reliability of the expected cumulative re ward for a given policy, and (2) perform a refined comparison between policies derived from different MDP models.We applied this methodology to a prior experiment where the objective was to select the best fea tures to include in the MDP state-space.~~
~~Our results show that policies constructed from the Baseline 1 and Concept Repetition models are more reliable, given the amount of data available for training.~~
~~The Concept Repetition model (which is composed of the Concept Repetition, Certainty and Correctness features) was especially useful, as it led to a policy that outperformed the Baseline 1 model, even when we take into account the uncertainty in the model estimates caused by data sparsity.~~
~~In contrast, for the Baseline 2, Percent Correctness, and Frustration models, the estimates for the expected cumulative reward are much less reliable, and no conclusion canbe reliably drawn about the usefulness of these fea tures.~~
~~In addition, we showed that our confidence interval approach has applications in another MDP problem: model switching.~~
~~As an extension of this work, we are currently investigating in more detail what makes some MDP?s reli able or unreliable for a certain data size (such as the case where Baseline 2 does not converge but a morecomplicated model does, such as Concept Repeti tion).~~
~~Our initial findings indicate that, as more databecomes available the bounds tighten for most pa rameters in the transition matrix.~~
~~However, for some of the parameters the bounds can remain wide, and that is enough to keep the confidence interval for the expected cumulative reward from converging.~~
~~AcknowledgmentsWe would like to thank Jeff Schneider, Drew Bag nell, Pam Jordan, as well as the ITSPOKE and Pitt NLP groups, and the Dialog on Dialogs group for their help and comments.~~
~~Finally, we would like tothank the four anonymous reviewers for their com ments on the initial version of this paper.~~
~~Support for this research was provided by NSF grants #0325054 and #0328431.~~