~~This paper proposes an improved approach to extractive summarization of spoken multi-party interac tion, in which integrated random walk is performed on a graph constructed on topical/ lexical relations.~~
~~Each utterance is represented as a node of the graph,and the edges?~~
~~weights are computed from the topical similarity between the utterances, evaluated us ing probabilistic latent semantic analysis (PLSA), and from word overlap.~~
~~We model intra-speaker topics by partially sharing the topics from the samespeaker in the graph.~~
~~In this paper, we perform ex periments on automatically and manually generated transcripts.~~
~~For automatic transcripts, our results show that intra-speaker topic sharing and integratingtopical/ lexical relations can help include the impor tant utterances.~~
~~Speech summarization is an active and important topic ofresearch (Lee and Chen, 2005), because multimedia/ spo ken documents are more difficult to browse than text or image content.~~
~~While earlier work was focused primarilyon broadcast news content, recent effort has been increas ingly directed to new domains such as lectures (Glasset al, 2007; Chen et al, 2011) and multi-party interac tion (Banerjee and Rudnicky, 2008; Liu and Liu, 2010).~~
~~We describe experiments on multi-party interaction foundin meeting recordings, performing extractive summarization (Liu et al, 2010) on transcripts generated by auto matic speech recognition (ASR) and human annotators.~~
~~Graph-based methods for computing lexical centrality as importance to extract summaries (Erkan and Radev,2004) have been investigated in the context of text summarization.~~
~~Some works focus on maximizing cover age of summaries using the objective function (Gillick,2011).~~
~~Speech summarization carries intrinsic difficulties due to the presence of recognition errors, spontaneous speech effect, and lack of segmentation.~~
~~A gen eral approach has been found very successful (Furui et al., 2004), in which each utterance in the document d, U = t1t2...ti...tn, represented as a sequence of terms ti, is given an importance score: I(U, d) = 1 n n?~~
~~i=1 [?1s(ti, d) + ?2l(ti) (1) + ?3c(ti) + ?4g(ti)] + ?5b(U), where s(ti, d), l(ti), c(ti), g(ti) are respectively some statistical measure (such as TF-IDF), linguistic measure (e.g., different part-of-speech tags are given different weights), confidence score and N-gram score for the term ti, and b(U) is calculated from the grammatical structureof the utterance U , and ?1, ?2, ?3, ?4 and ?5 are weight ing parameters.~~
~~For each document, the utterances to be used in the summary are then selected based on this score.~~
~~In recent work, Chen (2011) proposed a graphical structure to rescore I(U, d), which can model the topical coherence between utterances using random walk withindocuments.~~
~~Similarly, we now use a graph-based approach to consider the importance of terms and the similarity between utterances, where topical and lexical similarity are integrated in the graph, so that utterances topi cally or lexically similar to more important utterances aregiven higher scores.~~
~~Using topical similarity can compensate the negative effects of recognition errors on similarity evaluated on word overlap to some extent.~~
~~In addition, this paper proposes an approach of modeling intraspeaker topics in the graph to improve meeting summarization (Garg et al, 2009) using information from multi party interaction, which is not available in lectures or broadcast news.~~
~~We apply word stemming and noise utterance filtering for utterances in all meetings.~~
~~Then we construct a graph to compute the importance of all utterances.~~
~~377 U1 U2 U3 U4 U5 U6 pt(4, 3) pt(3, 4) A4t = {U3, U5, U6} B4t= {U1, U3, U6} Figure 1: A simplified example of the graph considered.We formulate the utterance selection problem as ran dom walk on a directed graph, in which each utterance is a node and the edges between them are weighted by topical and lexical similarity.~~
~~The basic idea is that an utterance similar to more important utterances should be more important (Chen et al, 2011).~~
~~We formulate two types of directed edge, topical edges and lexical edges,which are weighted by topical and lexical similarity re spectively.~~
~~We then keep only the top N outgoing edges with the highest weights from each node, while consider incoming edges to each node for importance propagation in the graph.~~
~~A simplified example for such a graph with topical edges is in Figure 1, in which Ati and B t i are the sets of neighbors of the node Ui connected respectively by outgoing and incoming topical edges.~~
~~2.1 Parameters from PLSA.~~
~~Probabilistic latent semantic analysis (PLSA) (Hofmann, 1999) has been widely used to analyze the semantics of documents based on a set of latent topics.~~
~~Given a set of documents {dj , j = 1, 2, ..., J} and all terms {ti, i = 1, 2, ...,M} they include, PLSA uses a set oflatent topic variables, {Tk, k = 1, 2, ...,K}, to charac terize the ?term-document?~~
~~co-occurrence relationships.~~
~~The PLSA model can be optimized with EM algorithmby maximizing a likelihood function.~~
~~We utilize two pa rameters from PLSA, latent topic significance (LTS) and latent topic entropy (LTE) (Kong and Lee, 2011) in the paper.~~
~~Latent Topic Significance (LTS) for a given term ti with respect to a topic Tk can be defined as LTSti(Tk) = ? dj?D n(ti, dj)P (Tk | dj) ? dj?D n(ti, dj)[1?~~
~~P (Tk | dj)] , (2) where n(ti, dj) is the occurrence count of term ti in a document dj . Thus, a higher LTSti(Tk) indicates the term ti is more significant for the latent topic Tk.~~
~~Latent Topic Entropy (LTE), for a given term ti can be calculated from the topic distribution P (Tk | ti): LTE(ti) = ? K? k=1 P (Tk | ti) logP (Tk | ti), (3) where the topic distribution P (Tk | ti) can be estimated from PLSA.~~
~~LTE(ti) is a measure of how the term ti is focused on a few topics, so a lower latent topic entropy implies the term carries more topical information.~~
~~2.2 Statistical Measures of a Term.~~
~~The statistical measure of a term ti, s(ti, d) in (1) can be defined in terms of LTE(ti) in (3) as s(ti, d) = ? ?~~
~~n(ti, d) LTE(ti) , (4) where ? is a scaling factor such that 0 ? s(ti, d) ? 1; the score s(ti, d) is inversely proportion to the latent topic entropy LTE(ti).~~
~~Some works (Kong and Lee, 2011)showed that the use in (1) of s(ti, d) as defined in (4) out performed the very successful ?significance score?~~
~~(Furui et al, 2004) in speech summarization; then, we use it as the baseline.~~
~~2.3 Similarity between Utterances.~~
~~Within a document d, we can first compute the probabil ity that the topic Tk is addressed by an utterance Ui: P (Tk | Ui) = ? t?Ui n(t, Ui)P (Tk | t) ? t?Ui n(t, Ui) .~~
~~(5) Then an asymmetric topical similarity TopicSim(Ui, Uj) for utterances Ui to Uj (with direction Ui ? Uj) can be defined by accumulating LTSt(Tk) in (2) weighted by P (Tk | Ui) for all terms t in Uj over all latent topics: TopicSim(Ui, Uj) = ? t?Uj K?~~
~~k=1 LTSt(Tk)P (Tk | Ui), (6) where the idea is very similar to the generative probability in IR.~~
~~We call it generative significance of Ui given Uj .Within a document d, the lexical similarity is the mea sure of word overlap between the utterance Ui and Uj .We compute LexSim(Ui, Uj) as the cosine similarity between two TF-IDF vectors from Ui and Uj like well known LexRank (Erkan and Radev, 2004).~~
~~Note that LexSim(Ui, Uj) = LexSim(Uj , Ui) 2.4 Intra-Speaker Topic Modeling.~~
~~We assume a single speaker usually focuses on similar topics, so if an utterance is important, the scores of the utterances from the same speaker should be increased.~~
~~Then we increase the similarity between the utterances from the same speaker to share the topics: TopicSim?k(Ui, Uj) = ? ???~~
~~TopicSim(Ui, Uj)1+w , if Ui ? Sk and Uj ? Sk TopicSim(Ui, Uj)1?w , otherwise (7) 378 where Sk is the set including all utterances from speaker k, and w is a weighting parameter for modeling the speaker relation, which means the level of coherence of topics within a single speaker.~~
~~Here the topics from the same speaker can partially shared.~~
~~2.5 Integrated Random Walk.~~
~~We modify random walk (Hsu and Kennedy, 2007; Chen et al, 2011) to integrate two types of similarity over the graph obtained above.~~
~~v(i) is the new score for node Ui, which is the interpolation of three scores, the normalizedinitial importance r(i) for node Ui and the score con tributed by all neighboring nodes Uj of node Ui weighted by pt(j, i) and pl(j, i), v(i) = (1?~~
~~?)r(i) (8) + ? ?~~
~~Uj?Bti pt(j, i)v(j) + ? ?~~
~~Uj?Bli pl(j, i)v(j), where ? and ? are the interpolation weights, Bti is the set of neighbors connected to node Ui via topical incoming edges,Bli is the set of neighbors connected to node Ui via lexical incoming edges, and r(i) = I(Ui, d) ? Uj I(Uj , d) (9) is normalized importance scores of utterance Ui, I(Ui, d)in (1).~~
~~We normalize topical similarity by the total similarity summed over the set of outgoing edges, to pro duce the weight pt(j, i) for the edge from Uj to Ui on the graph.~~
~~Similarly, pl(j, i) is normalized in lexical edges.~~
~~(8) can be iteratively solved with the approach very similar to that for the PageRank problem (Page et al, 1998).~~
~~Let v = [v(i), i = 1, 2, ..., L]T and r = [r(i), i = 1, 2, ..., L]T be the column vectors for v(i) and r(i) for all utterances in the document, where L is the total numberof utterances in the document d and T represents trans pose.~~
~~(8) then has a vector form below, v = (1?~~
~~?)r+ ?Ptv + ?Plv (10) = ( (1?~~
~~?)reT + ?Pt + ?Pl ) v = P?v, where Pt and Pl areL?Lmatrices of pt(j, i) and pl(j, i) respectively, and e = [1, 1, ..., 1]T. It has been shown that the solution v of (10) is the dominant eigenvector of P?~~
~~(Langville and Meyer, 2006), or the eigenvector corresponding to the largest absolute eigenvalue of P?.~~
~~The solution v(i) can then be obtained.~~
~~3.1 Corpus.~~
~~The corpus used in this research consists of a sequence ofnaturally occuring meetings, which featured largely over lapping participant sets and topics of discussion.~~
~~For each meeting, SmartNotes (Banerjee and Rudnicky, 2008) was used to record both the audio from each participant as well as his notes.~~
~~The meetings were transcribed both manually and using a speech recognizer; the word error rate is around 44%.~~
~~In this paper we use 10 meetings held from April to June of 2006.~~
~~On average each meeting had about 28 minutes of speech.~~
~~Across these 10 meetings there were 6 unique participants; each meeting featured between 2 and 4 of these participants (average: 3.7).~~
~~The total number of utterances is 9837 across 10 meetings.~~
~~In this paper, we separate dev set (2 meetings) and test set (8 meetings).~~
~~Dev set is used to tune the parameters such as ?, ?, w.The reference summaries are given by the set of ?note worthy utterances?: two annotators manually labelled thedegree (three levels) of ?noteworthiness?~~
~~for each utter ance, and we extract the utterances with the top level of ?noteworthiness?~~
~~to form the summary of each meeting.In the following experiments, for each meeting, we ex tract the top 30% number of terms as the summary.~~
~~3.2 Evaluation Metrics.~~
~~Automated evaluation utilizes the standard DUC eval uation metric ROUGE (Lin, 2004) which representsrecall over various n-grams statistics from a system generated summary against a set of human generated peer summaries.~~
~~F-measures for ROUGE-1 (unigram) andROUGE-L (longest common subsequence) can be evaluated in exactly the same way, which are used in the fol lowing results.~~
~~3.3 Results.~~
~~Table 1 shows the performance achieved by all proposed approaches.~~
~~In these experiments, the damping factor, (1 ? ?~~
~~in (8), is empirically set to 0.1.~~
~~Row (a) is the baseline, which use LTE-based statistical measure to compute the importance of utterances I(U, d).~~
~~Row (b) is the result only considering lexical similarity; row(c) only uses topical similarity.~~
~~Row (d) are the re sults additionally including speaker information such as TopicSim?(Ui, Uj).~~
~~Row (e) is the result performed by integrated random walk (with ? 6= 0 and ? 6= 0) using parameters that have been optimized on the dev set.~~
~~3.3.1 Graph-Based ApproachWe can see the performance after graph-based re computation, shown in rows (b) and (c), is significantly better than the baseline, shown in row (a), for both ASRand manual transcripts.~~
~~For ASR transcripts, topical similarity and lexical similarity give similar results.~~
~~For man ual transcripts, topical similarity performs slightly worse than lexical similarity, because manual transcripts don?t contain the recognition errors, and therefore word overlapcan accurately measure the similarity between two utter 379 F-measure ASR Transcripts Manual Transcripts ROUGE-1 ROUGE-L ROUGE-1 ROUGE-L (a) Baseline: LTE 46.816 46.256 44.987 44.162 (b) LexSim (?~~
~~= 0, ? = 0.9) 48.940 48.504 46.540 45.858 (c) TopicSim (?~~
~~= 0.9, ? = 0) 49.058 48.436 46.199 45.392 (d) Intra-Speaker TopicSim 49.212 48.351 47.104 46.299 (e) Integrated Random Walk 49.792 49.156 46.714 46.064 MAX RI +6.357 +6.269 +4.706 +4.839 Table 1: Maximum relative improvement (RI) with respect to the baseline for all proposed approaches (%).~~
~~48 48.5 49 49.5 50 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ROUGE-1 ROUGE-L ? ?~~
~~F-measure Figure 2: The performance from integrated random walk with different combination weights, ? and ?~~
~~+ ? = 0.9 in all cases) for ASR transcripts.~~
~~ances.~~
~~However, for ASR transcripts, although topical similarity is not as accurate as lexical similarity, it can compensate for recognition errors, so that the approaches have similar performance.~~
~~Thus, graph-based approaches can significantly improve the baseline results.~~
~~3.3.2 Effectiveness of Intra-Speaker Modeling We find that modeling intra-speaker topics can improve the performance (row (c) and row (d)), which meansspeaker information is useful to model the topical simi larity.~~
~~The experiment shows intra-speaker modeling can help us include the important utterances for both ASR and manual transcripts.~~
~~3.3.3 Integration of Topical and Lexical Similarity Row (e) shows the result of the proposed approach,which integrates topical and lexical similarity into a sin gle graph, considering two types of relations together.~~
~~For ASR transcripts, row (e) is better than row (b) androw (d), which means topical similarity and lexical sim ilarity can model different types of relations, because of recognition errors.~~
~~Figure 2 shows the sensitivity of the combination weights for integrated random walk.~~
~~We can see topical similarity and lexical similarity are additive,i.e. they can compensate each other, improving the per formance by integrating two types of edges in a singlegraph.~~
~~Note that the exact values of ? and ? do not mat ter so much for the performance.~~
~~For manual transcripts, row (e) cannot perform better by combing two types of similarity, which means topical similarity can dominate lexical similarity, since withoutrecognition errors topical similarity can model the rela tions accurately and additionally modeling intra-speaker topics can effectively improve the performance.In addition, Banerjee and Rudnicky (2008) used su pervised learning to detect noteworthy utterances on the same corpus, and achieved ROGURE-1 scores of around 43% for ASR, and 47% for manual transcriptions.~~
~~Our unsupervised approach performs better, especially for ASR transcripts.~~
~~Note that the performance on ASR is better than on manual transcripts.~~
~~Because a higher percentage of recognition errors occurs on ?unimportant?~~
~~words, these words tend to receive lower scores; we can then excludethe utterances with more errors, and achieve better summarization results.~~
~~Other recent work has also demonstrated better performance for ASR than manual tran scripts (Chen et al, 2011; Kong and Lee, 2011).~~
~~Extensive experiments and evaluation with ROUGE met rics showed that intra-speaker topics can be modeled in topical similarity and that integrated random walk can combine the advantages from two types of edges for imperfect ASR transcripts, where we achieved morethan 6% relative improvement.~~
~~We plan to model inter speaker topics in the graph-based approach in the future.~~
~~AcknowledgementsThe first author was supported by the Institute of Edu cation Science, U.S. Department of Education, through Grants R305A080628 to Carnegie Mellon University.Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the au thors and do not necessarily reflect the views or official policies, either expressed or implied of the Institute or the U.S. Department of Education.~~
~~380~~