~~We present a method for unsupervised topic modelling which adapts methods used in document classification (Blei et al., 2003; Griffiths and Steyvers, 2004) tounsegmented multi-party discourse transcripts.~~
~~We show how Bayesian infer ence in this generative model can beused to simultaneously address the prob lems of topic segmentation and topic identification: automatically segmentingmulti-party meetings into topically co herent segments with performance whichcompares well with previous unsuper vised segmentation-only methods (Galleyet al, 2003) while simultaneously extract ing topics which rate highly when assessed for coherence by human judges.~~
~~We also show that this method appears robust in the face of off-topic dialogue and speech recognition errors.~~
~~Topic segmentation ? division of a text or dis course into topically coherent segments ? andtopic identification ? classification of those seg ments by subject matter ? are joint problems.~~
~~Both are necessary steps in automatic indexing, retrieval and summarization from large datasets, whether spoken or written.~~
~~Both have received significantattention in the past (see Section 2), but most approaches have been targeted at either text or mono logue, and most address only one of the two issues (usually for the very good reason that the datasetitself provides the other, for example by the ex plicit separation of individual documents or newsstories in a collection).~~
~~Spoken multi-party meet ings pose a difficult problem: firstly, neither the segmentation nor the discussed topics can be taken as given; secondly, the discourse is by nature less tidily structured and less restricted in domain; andthirdly, speech recognition results have unavoidably high levels of error due to the noisy multi speaker environment.In this paper we present a method for unsuper vised topic modelling which allows us to approach both problems simultaneously, inferring a set oftopics while providing a segmentation into topi cally coherent segments.~~
~~We show that this modelcan address these problems over multi-party dis course transcripts, providing good segmentationperformance on a corpus of meetings (compara ble to the best previous unsupervised method that we are aware of (Galley et al, 2003)), while alsoinferring a set of topics rated as semantically co herent by human judges.~~
~~We then show that itssegmentation performance appears relatively robust to speech recognition errors, giving us con fidence that it can be successfully applied in a real speech-processing system.~~
~~The plan of the paper is as follows.~~
~~Section 2 below briefly discusses previous approaches to theidentification and segmentation problems.~~
~~Section 3 then describes the model we use here.~~
~~Sec tion 4 then details our experiments and results, and conclusions are drawn in Section 5.~~
~~In this paper we are interested in spoken discourse,and in particular multi-party human-human meet ings.~~
~~Our overall aim is to produce information which can be used to summarize, browse and/or retrieve the information contained in meetings.~~
~~User studies (Lisowska et al, 2004; Banerjee etal., 2005) have shown that topic information is im portant here: people are likely to want to know 17which topics were discussed in a particular meet ing, as well as have access to the discussion on particular topics in which they are interested.~~
~~Ofcourse, this requires both identification of the top ics discussed, and segmentation into the periods of topically related discussion.~~
~~Work on automatic topic segmentation of text and monologue has been prolific, with a variety of approaches used.~~
~~(Hearst, 1994) uses a measure of lexical cohesion between adjoining paragraphs in text; (Reynar, 1999) and (Beeferman et al, 1999) combine a variety of features such as statisticallanguage modelling, cue phrases, discourse infor mation and the presence of pronouns or named entities to segment broadcast news; (Maskey andHirschberg, 2003) use entirely non-lexical features.~~
~~Recent advances have used generative models, allowing lexical models of the topics them selves to be built while segmenting (Imai et al,1997; Barzilay and Lee, 2004), and we take a sim ilar approach here, although with some important differences detailed below.~~
~~Turning to multi-party discourse and meetings,however, most previous work on automatic seg mentation (Reiter and Rigoll, 2004; Dielmann and Renals, 2004; Banerjee and Rudnicky, 2004), treats segments as representing meeting phases orevents which characterize the type or style of discourse taking place (presentation, briefing, discus sion etc.), rather than the topic or subject matter.~~
~~While we expect some correlation between thesetwo types of segmentation, they are clearly differ ent problems.~~
~~However, one comparable study isdescribed in (Galley et al, 2003).~~
~~Here, a lex ical cohesion approach was used to develop anessentially unsupervised segmentation tool (LCSeg) which was applied to both text and meet ing transcripts, giving performance better than thatachieved by applying text/monologue-based tech niques (see Section 4 below), and we take this as our benchmark for the segmentation problem.Note that they improved their accuracy by combining the unsupervised output with discourse fea tures in a supervised classifier ? while we do not attempt a similar comparison here, we expect asimilar technique would yield similar segmenta tion improvements.~~
~~In contrast, we take a generative approach,modelling the text as being generated by a sequence of mixtures of underlying topics.~~
~~The approach is unsupervised, allowing both segmenta tion and topic extraction from unlabelled data.~~
~~We specify our model to address the problem oftopic segmentation: attempting to break the discourse into discrete segments in which a particu lar set of topics are discussed.~~
~~Assume we have a corpus of U utterances, ordered in sequence.~~
~~The uth utterance consists of Nu words, chosen froma vocabulary of size W . The set of words asso ciated with the uth utterance are denoted wu, and indexed as wu,i. The entire corpus is represented by w. Following previous work on probabilistic topicmodels (Hofmann, 1999; Blei et al, 2003; Grif fiths and Steyvers, 2004), we model each utterance as being generated from a particular distributionover topics, where each topic is a probability dis tribution over words.~~
~~The utterances are ordered sequentially, and we assume aMarkov structure on the distribution over topics: with high probability, the distribution for utterance u is the same as forutterance u?1; otherwise, we sample a new distri bution over topics.~~
~~This pattern of dependency isproduced by associating a binary switching vari able with each utterance, indicating whether its topic is the same as that of the previous utterance.The joint states of all the switching variables define segments that should be semantically coher ent, because their words are generated by the same topic vector.~~
~~We will first describe this generative model in more detail, and then discuss inference in this model.~~
~~3.1 A hierarchical Bayesian model We are interested in where changes occur in the set of topics discussed in these utterances.~~
~~To thisend, let cu indicate whether a change in the distri bution over topics occurs at the uth utterance andlet P (cu = 1) = pi (where pi thus defines the ex pected number of segments).~~
~~The distribution overtopics associated with the uth utterance will be de noted ?(u), and is a multinomial distribution over T topics, with the probability of topic t being ?(u)t . If cu = 0, then ?(u) = ?(u?1).~~
~~Otherwise, ?(u) is drawn from a symmetric Dirichlet distribution with parameter ?.~~
~~The distribution is thus: P (?(u)|cu, ?~~
~~(u?1)) = ( ?(?(u), ?(u?1)) cu = 0 ?(T?) ?(?)T QT t=1(?~~
~~(u) t ) ??1 cu = 1 18 Figure 1: Graphical models indicating the dependencies among variables in (a) the topic segmentation model and (b) the hidden Markov model used as a comparison.~~
~~where ?(?, ?) is the Dirac delta function, and ?(?)is the generalized factorial function.~~
~~This dis tribution is not well-defined when u = 1, so we set c1 = 1 and draw ?(1) from a symmetric Dirichlet(?)~~
~~distribution accordingly.As in (Hofmann, 1999; Blei et al, 2003; Griffiths and Steyvers, 2004), each topic Tj is a multinomial distribution ?(j) over words, and the prob ability of the word w under that topic is ?(j)w . Theuth utterance is generated by sampling a topic as signment zu,i for each word i in that utterance with P (zu,i = t|?(u)) = ?~~
~~(u) t , and then sampling a word wu,i from ?(j), with P (wu,i = w|zu,i = j, ?(j)) = ?(j)w . If we assume that pi is generated from a symmetric Beta(?)~~
~~distribution, and each ?(j) is generated from a symmetric Dirichlet(?)~~
~~distribution, we obtain a joint distribution over all of these variables with the dependency structure shown in Figure 1A.~~
~~3.2 Inference.~~
~~Assessing the posterior probability distributionover topic changes c given a corpus w can be sim plified by integrating out the parameters ?, ?, and pi.~~
~~According to Bayes rule we have: P (z, c|w) = P (w|z)P (z|c)P (c) P z,c P (w|z)P (z|c)P (c) (1) Evaluating P (c) requires integrating over pi.~~
~~Specifically, we have: P (c) = R 1 0 P (c|pi)P (pi) dpi = ?(2?)?(?)2 ?(n1+?)?(n0+?)~~
~~?(N+2?)~~
~~(2) where n1 is the number of utterances for which cu = 1, and n0 is the number of utterances for which cu = 0.~~
~~Computing P (w|z) proceeds along similar lines: P (w|z) = R ?TW P (w|z, ?)P (?)~~
~~d? = ? ?(W?) ?(?)W ?T QT t=1 QW w=1 ?(n (t) w +?)~~
~~?(n(t)?~~
~~+W?)~~
~~(3) where ?TW is the T -dimensional cross-product of the multinomial simplex on W points, n(t)w is the number of times word w is assigned to topic t in z, and n(t)?~~
~~is the total number of words assigned to topic t in z. To evaluate P (z|c) we have: P (z|c) = Z ?UT P (z|?)P (?|c) d?~~
~~(4) The fact that the cu variables effectively divide the sequence of utterances into segments that use the same distribution over topics simplifies solving the integral and we obtain: P (z|c) = ? ?(T?) ?(?)T ?n1 Y u?U1 QT t=1 ?(n (Su) t + ?) ?(n(Su)?~~
~~+ T?)~~
~~(5) 19 P (cu|c?u, z,w) ? 8 > >< > >: QT t=1 ?(n (S0u) t +?)~~
~~?(n (S0u) ? +T?) n0+?~~
~~N+2?~~
~~cu = 0 ?(T?) ?(?)T QT t=1 ?(n (S1u?1) t +?)~~
~~?(n (S1u?1) ? +T?) QT t=1 ?(n (S1u) t +?)~~
~~?(n (S1u) ? +T?) n1+?~~
~~N+2?~~
~~cu = 1 (7) where U1 = {u|cu = 1}, U0 = {u|cu = 0}, Su denotes the set of utterances that share the same topic distribution (i.e. belong to the same segment)as u, and n(Su)t is the number of times topic t ap pears in the segment Su (i.e. in the values of zu?~~
~~corresponding for u?~~
~~Su).~~
~~Equations 2, 3, and 5 allow us to evaluate thenumerator of the expression in Equation 1.~~
~~How ever, computing the denominator is intractable.Consequently, we sample from the posterior dis tribution P (z, c|w) using Markov chain Monte Carlo (MCMC) (Gilks et al, 1996).~~
~~We use Gibbs sampling, drawing the topic assignment for eachword, zu,i, conditioned on all other topic assign ments, z?(u,i), all topic change indicators, c, and all words, w; and then drawing the topic change indicator for each utterance, cu, conditioned on allother topic change indicators, c?u, all topic as signments z, and all words w. The conditional probabilities we need can be derived directly from Equations 2, 3, and 5.~~
~~Theconditional probability of zu,i indicates the probability that wu,i should be assigned to a particular topic, given other assignments, the current segmentation, and the words in the utterances.~~
~~Can celling constant terms, we obtain: P (zu,i|z?(u,i), c,w) = n(t)wu,i + ? n(t)?~~
~~+ W?~~
~~n(Su)zu,i + ? n(Su)?~~
~~+ T?~~
~~(6) where all counts (i.e. the n terms) exclude zu,i. The conditional probability of cu indicates the probability that a new segment should start at u.In sampling cu from this distribution, we are split ting or merging segments.~~
~~Similarly we obtain theexpression in (7), where S1u is Su for the segmen tation when cu = 1, S0u is Su for the segmentation when cu = 0, and all counts (e.g. n1) exclude cu.~~
~~For this paper, we fixed ?, ? and ? at 0.01.~~
~~Our algorithm is related to (Barzilay and Lee, 2004)?s approach to text segmentation, which usesa hiddenMarkov model (HMM) to model segmen tation and topic inference for text using a bigram representation in restricted domains.~~
~~Due to theadaptive combination of different topics our algo rithm can be expected to generalize well to larger domains.~~
~~It also relates to earlier work by (Blei and Moreno, 2001) that uses a topic representationbut also does not allow adaptively combining dif ferent topics.~~
~~However, while HMM approaches allow a segmentation of the data by topic, they do not allow adaptively combining different topicsinto segments: while a new segment can be mod elled as being identical to a topic that has alreadybeen observed, it can not be modelled as a com bination of the previously observed topics.1 Notethat while (Imai et al, 1997)?s HMM approach al lows topic mixtures, it requires supervision with hand-labelled topics.~~
~~In our experiments we therefore compared our results with those obtained by a similar but simpler10 state HMM, using a similar Gibbs sampling algorithm.~~
~~The key difference between the twomod els is shown in Figure 1.~~
~~In the HMM, all variation in the content of utterances is modelled at a single level, with each segment having a distribution overwords corresponding to a single state.~~
~~The hierar chical structure of our topic segmentation model allows variation in content to be expressed at two levels, with each segment being produced from a linear combination of the distributions associatedwith each topic.~~
~~Consequently, our model can of ten capture the content of a sequence of words bypostulating a single segment with a novel distribu tion over topics, while the HMM has to frequently switch between states.~~
~~4.1 Experiment 0: Simulated data.~~
~~To analyze the properties of this algorithm we first applied it to a simulated dataset: a sequence of 10,000 words chosen from a vocabulary of 25.Each segment of 100 successive words had a con1Say that a particular corpus leads us to infer topics corresponding to ?speech recognition?~~
~~and ?discourse understand ing?.~~
~~A single discussion concerning speech recognition for discourse understanding could be modelled by our algorithm as a single segment with a suitable weighted mixture of thetwo topics; a HMM approach would tend to split it into mul tiple segments (or require a specific topic for this segment).~~
~~20 Figure 2: Simulated data: A) inferred topics; B) segmentation probabilities; C) HMM version.stant topic distribution (with distributions for different segments drawn from a Dirichlet distribu tion with ? = 0.1), and each subsequence of 10words was taken to be one utterance.~~
~~The topic word assignments were chosen such that when the vocabulary is aligned in a 5?5 grid the topics were binary bars.~~
~~The inference algorithm was then run for 200,000 iterations, with samples collected after every 1,000 iterations to minimize autocorrelation.Figure 2 shows the inferred topic-word distribu tions and segment boundaries, which correspond well with those used to generate the data.~~
~~4.2 Experiment 1: The ICSI corpus.~~
~~We applied the algorithm to the ICSI meetingcorpus transcripts (Janin et al, 2003), consist ing of manual transcriptions of 75 meetings.~~
~~For evaluation, we use (Galley et al, 2003)?s set of human-annotated segmentations, which covers a sub-portion of 25 meetings and takes a relatively coarse-grained approach to topic with an average of 5-6 topic segments per meeting.~~
~~Note that these segmentations were not used in training themodel: topic inference and segmentation was un supervised, with the human annotations used onlyto provide some knowledge of the overall segmen tation density and to evaluate performance.The transcripts from all 75 meetings were lin earized by utterance start time and merged into a single dataset that contained 607,263 word tokens.We sampled for 200,000 iterations of MCMC, taking samples every 1,000 iterations, and then aver aged the sampled cu variables over the last 100 samples to derive an estimate for the posteriorprobability of a segmentation boundary at each utterance start.~~
~~This probability was then thresh olded to derive a final segmentation which wascompared to the manual annotations.~~
~~More pre cisely, we apply a small amount of smoothing(Gaussian kernel convolution) and take the mid points of any areas above a set threshold to be the segment boundaries.~~
~~Varying this threshold allowsus to segment the discourse in a more or less fine grained way (and we anticipate that this could be user-settable in a meeting browsing application).~~
~~If the correct number of segments is known for a meeting, this can be used directly to determine the optimum threshold, increasing performance; if not, we must set it at a level which corresponds to the desired general level of granularity.~~
~~For each set of annotations, we therefore performed two sets of segmentations: one in which the thresholdwas set for each meeting to give the known gold standard number of segments, and one in which the threshold was set on a separate development set to give the overall corpus-wide average numberof segments, and held constant for all test meet ings.2 This also allows us to compare our results with those of (Galley et al, 2003), who apply asimilar threshold to their lexical cohesion func tion and give corresponding results produced with known/unknown numbers of segments.Segmentation We assessed segmentation performance using the Pk and WindowDiff (WD) er ror measures proposed by (Beeferman et al, 1999) and (Pevzner and Hearst, 2002) respectively; both intuitively provide a measure of the probability that two points drawn from the meeting will be incorrectly separated by a hypothesized segmentboundary ? thus, lower Pk and WD figures indi cate better agreement with the human-annotatedresults.3 For the numbers of segments we are deal ing with, a baseline of segmenting the discourse into equal-length segments gives both Pk and WD about 50%.~~
~~In order to investigate the effect of thenumber of underlying topics T , we tested models using 2, 5, 10 and 20 topics.~~
~~We then compared performance with (Galley et al, 2003)?s LCSeg tool, and with a 10-state HMM model as described above.~~
~~Results are shown in Table 1, aver aged over the 25 test meetings.Results show that our model significantly out performs the HMM equivalent ? because the HMM cannot combine different topics, it placesa lot of segmentation boundaries, resulting in in ferior performance.~~
~~Using stemming and a bigram2The development set was formed from the other meetings in the same ICSI subject areas as the annotated test meet ings.~~
~~3WD takes into account the likely number of incorrectlyseparating hypothesized boundaries; Pk only a binary cor rect/incorrect classification.~~
~~21 Figure 3: Results from the ICSI corpus: A) the words most indicative for each topic; B) Probability of asegment boundary, compared with human segmentation, for an arbitrary subset of the data; C) Receiver operator characteristic (ROC) curves for predicting human segmentation, and conditional probabilities of placing a boundary at an offset from a human boundary; D) subjective topic coherence ratings.~~
~~Number of topics T Model 2 5 10 20 HMM LCSeg Pk .284 .297 .329 .290 .375 .319 known unknown Model Pk WD Pk WD T = 10 .289 .329 .329 .353 LCSeg .264 .294 .319 .359 Table 1: Results on the ICSI meeting corpus.representation, however, might improve its performance (Barzilay and Lee, 2004), although simi lar benefits might equally apply to our model.~~
~~It also performs comparably to (Galley et al, 2003)?s unsupervised performance (exceeding it for some settings of T ).~~
~~It does not perform as well as theirhybrid supervised system, which combined LCSeg with supervised learning over discourse features (Pk = .23); but we expect that a similar approach would be possible here, combining our seg mentation probabilities with other discourse-basedfeatures in a supervised way for improved per formance.~~
~~Interestingly, segmentation quality, at least at this relatively coarse-grained level, seems hardly affected by the overall number of topics T . Figure 3B shows an example for one meeting of how the inferred topic segmentation probabilities at each utterance compare with the gold-standardsegment boundaries.~~
~~Figure 3C illustrates the per formance difference between our model and theHMM equivalent at an example segment boundary: for this example, the HMM model gives al most no discrimination.Identification Figure 3A shows the most indica tive words for a subset of the topics inferred at the last iteration.~~
~~Encouragingly, most topics seem intuitively to reflect the subjects we know were discussed in the ICSI meetings ? the majority of them (67 meetings) are taken from the weeklymeetings of 3 distinct research groups, where discussions centered around speech recognition tech niques (topics 2, 5), meeting recording, annotationand hardware setup (topics 6, 3, 1, 8), robust lan guage processing (topic 7).~~
~~Others reflect general classes of words which are independent of subject matter (topic 4).~~
~~To compare the quality of these inferred topics we performed an experiment in which 7 humanobservers rated (on a scale of 1 to 9) the seman tic coherence of 50 lists of 10 words each.~~
~~Of these lists, 40 contained the most indicative words for each of the 10 topics from different models: the topic segmentation model; a topic model that had the same number of segments but with fixedevenly spread segmentation boundaries; an equiv 22alent with randomly placed segmentation bound aries; and the HMM.~~
~~The other 10 lists contained random samples of 10 words from the other 40 lists.~~
~~Results are shown in Figure 3D, with thetopic segmentation model producing the most co herent topics and the HMM model and random words scoring less well.~~
~~Interestingly, using an even distribution of boundaries but allowing the topic model to infer topics performs similarly well with even segmentation, but badly with randomsegmentation ? topic quality is thus not very sus ceptible to the precise segmentation of the text, but does require some reasonable approximation (on ICSI data, an even segmentation gives a Pk of about 50%, while random segmentations can do much worse).~~
~~However, note that the full topic segmentation model is able to identify meaningfulsegmentation boundaries at the same time as infer ring topics.~~
~~4.3 Experiment 2: Dialogue robustness.~~
~~Meetings often include off-topic dialogue, in particular at the beginning and end, where informal chat and meta-dialogue are common.~~
~~Galley et al (2003) annotated these sections explic itly, together with the ICSI ?digit-task?~~
~~sections (participants read sequences of digits to providedata for speech recognition experiments), and removed them from their data, as did we in Ex periment 1 above.~~
~~While this seems reasonable for the purposes of investigating ideal algorithm performance, in real situations we will be faced with such off-topic dialogue, and would obviously prefer segmentation performance not to be badly affected (and ideally, enabling segmentation of the off-topic sections from the meeting proper).One might suspect that an unsupervised genera tive model such as ours might not be robust in thepresence of numerous off-topic words, as spurious topics might be inferred and used in the mix ture model throughout.~~
~~In order to investigate this,we therefore also tested on the full dataset with out removing these sections (806,026 word tokensin total), and added the section boundaries as further desired gold-standard segmentation bound aries.~~
~~Table 2 shows the results: performance isnot significantly affected, and again is very simi lar for both our model and LCSeg.~~
~~4.4 Experiment 3: Speech recognition.~~
~~The experiments so far have all used manual wordtranscriptions.~~
~~Of course, in real meeting pro known unknown Experiment Model Pk WD Pk WD 2 T = 10 .296 .342 .325 .366 (off-topic data) LCSeg .307 .338 .322 .386 3 T = 10 .266 .306 .291 .331 (ASR data) LCSeg .289 .339 .378 .472Table 2: Results for Experiments 2 & 3: robust ness to off-topic and ASR data.~~
~~cessing systems, we will have to deal with speech recognition (ASR) errors.~~
~~We therefore also testedon 1-best ASR output provided by ICSI, and re sults are shown in Table 2.~~
~~The ?off-topic?~~
~~and?digits?~~
~~sections were removed in this test, so results are comparable with Experiment 1.~~
~~Segmentation accuracy seems extremely robust; interest ingly, LCSeg?s results are less robust (the drop inperformance is higher), especially when the num ber of segments in a meeting is unknown.~~
~~It is surprising to notice that the segmentation accuracy in this experiment was actually slightly higher than achieved in Experiment 1 (especially given that ASR word error rates were generallyabove 20%).~~
~~This may simply be a smoothing ef fect: differences in vocabulary and its distribution can effectively change the prior towards sparsity instantiated in the Dirichlet distributions.~~
~~We have presented an unsupervised generativemodel which allows topic segmentation and iden tification from unlabelled data.~~
~~Performance onthe ICSI corpus of multi-party meetings is compa rable with the previous unsupervised segmentation results, and the extracted topics are rated well by human judges.~~
~~Segmentation accuracy is robust in the face of noise, both in the form of off-topic discussion and speech recognition hypotheses.~~
~~Future Work Spoken discourse exhibits several features not derived from the words themselvesbut which seem intuitively useful for segmenta tion, e.g. speaker changes, speaker identities and roles, silences, overlaps, prosody and so on.~~
~~Asshown by (Galley et al, 2003), some of these fea tures can be combined with lexical information to improve segmentation performance (although in a supervised manner), and (Maskey and Hirschberg,2003) show some success in broadcast news seg mentation using only these kinds of non-lexicalfeatures.~~
~~We are currently investigating the addi tion of non-lexical features as observed outputs in 23 our unsupervised generative model.~~
~~We are also investigating improvements into the lexical model as presented here, firstly via simpletechniques such as word stemming and replace ment of named entities by generic class tokens (Barzilay and Lee, 2004); but also via the use of multiple ASR hypotheses by incorporating word confusion networks into our model.~~
~~We expect that this will allow improved segmentation and identification performance with ASR data.~~
~~Acknowledgements This work was supported by the CALO project (DARPA grant NBCH-D-03-0010).~~
~~We thankElizabeth Shriberg and Andreas Stolcke for pro viding automatic speech recognition data for the ICSI corpus and for their helpful advice; John Niekrasz and Alex Gruenstein for help with theNOMOS corpus annotation tool; and Michel Gal ley for discussion of his approach and results.~~