~~We address the problem of learning the mapping between words and their possible pro nunciations in terms of sub-word units.~~
~~Mostprevious approaches have involved generative modeling of the distribution of pronuncia tions, usually trained to maximize likelihood.We propose a discriminative, feature-rich approach using large-margin learning.~~
~~This ap proach allows us to optimize an objective closely related to a discriminative task, toincorporate a large number of complex fea tures, and still do inference efficiently.~~
~~We test the approach on the task of lexical access;that is, the prediction of a word given a phonetic transcription.~~
~~In experiments on a sub set of the Switchboard conversational speechcorpus, our models thus far improve classification error rates from a previously pub lished result of 29.1% to about 15%.~~
~~We find that large-margin approaches outperform conditional random field learning, and thatthe Passive-Aggressive algorithm for large margin learning is faster to converge than the Pegasos algorithm.~~
~~One of the problems faced by automatic speech recognition, especially of conversational speech, is that of modeling the mapping between words and their possible pronunciations in terms of sub-wordunits such as phones.~~
~~While pronouncing dictionar ies provide each word?s canonical pronunciation(s)in terms of phoneme strings, running speech of ten includes pronunciations that differ greatly from the dictionary.~~
~~For example, some pronunciations of ?probably?~~
~~in the Switchboard conversational speech database are [p r aa b iy], [p r aa l iy], [p r ay], and [p ow ih] (Greenberg et al, 1996).~~
~~While some words (e.g., common words) are more prone to such variation than others, the effect is extremely general: In the phonetically transcribed portion of Switchboard, fewer than half of the word tokens are pronounced canonically (Fosler-Lussier, 1999).In addition, pronunciation variants sometimes in clude sounds not present in the dictionary at all, such as nasalized vowels (?can?t? ? [k ae n n t]) or fricatives introduced due to incomplete consonantclosures (?legal?~~
~~[l iy g fr ix l]).1 This varia tion makes pronunciation modeling one of the major challenges facing speech recognition (McAllaster etal., 1998; Jurafsky et al, 2001; Sarac?lar and Khu danpur, 2004; Bourlard et al, 1999).~~
~~2 Most efforts to address the problem have involved either learning alternative pronunciations and/or their probabilities (Holter and Svendsen, 1999) orusing phonetic transformation (substitution, insertion, and deletion) rules, which can come from lin guistic knowledge or be learned from data (Riley et al, 1999; Hazen et al, 2005; Hutchinson andDroppo, 2011).~~
~~These have produced some im provements in recognition performance.~~
~~However, they also tend to cause additional confusability dueto the introduction of additional homonyms (Fosler1We use the ARPAbet phonetic alphabet with additional di acritics, such as [ n] for nasalization and [ fr] for frication.~~
~~2This problem is separate from the grapheme-to-phoneme problem, in which pronunciations are predicted from a word?s spelling; here, we assume the availability of a dictionary of canonical pronunciations as is usual in speech recognition.~~
~~194 Lussier et al, 2002).~~
~~Some other alternatives are articulatory pronunciation models, in which wordsare represented as multiple parallel sequences of ar ticulatory features rather than single sequences of phones, and which outperform phone-based models on some tasks (Livescu and Glass, 2004; Jyothi etal., 2011); and models for learning edit distances be tween dictionary and actual pronunciations (Ristad and Yianilos, 1998; Filali and Bilmes, 2005).~~
~~All of these approaches are generative?i.e., they provide distributions over possible pronunciations given the canonical one(s)?and they are typicallytrained by maximizing the likelihood over training data.~~
~~In some recent work, discriminative ap proaches have been proposed, in which an objective more closely related to the task at hand is optimized.~~
~~For example, (Vinyals et al, 2009; Korkmazskiy and Juang, 1997) optimize a minimum classificationerror (MCE) criterion to learn the weights (equiv alently, probabilities) of alternative pronunciations for each word; (Schramm and Beyerlein, 2001) usea similar approach with discriminative model com bination.~~
~~In this work, the weighted alternatives arethen used in a standard (generative) speech recog nizer.~~
~~In other words, these approaches optimize generative models using discriminative criteria.We propose a general, flexible discriminative approach to pronunciation modeling, rather than dis criminatively optimizing a generative model.~~
~~We formulate a linear model with a large number of word-level and subword-level feature functions,whose weights are learned by optimizing a discriminative criterion.~~
~~The approach is related to the re cently proposed segmental conditional random field (SCRF) approach to speech recognition (Zweig etal., 2011).~~
~~The main differences are that we opti mize large-margin objective functions, which lead to sparser, faster, and better-performing models thanconditional random field optimization in our exper iments; and we use a large set of different feature functions tailored to pronunciation modeling.~~
~~In order to focus attention on the pronunciation model alone, our experiments focus on a task thatmeasures only the mapping between words and sub word units.~~
~~Pronunciation models have in the pastbeen tested using a variety of measures.~~
~~For generative models, phonetic error rate of generated pro nunciations (Venkataramani and Byrne, 2001) and phone- or frame-level perplexity (Riley et al, 1999; Jyothi et al, 2011) are appropriate measures.~~
~~For our discriminative models, we consider the task of lexical access; that is, prediction of a single word given its pronunciation in terms of sub-word units (Fissore et al, 1989; Jyothi et al, 2011).~~
~~This task is also sometimes referred to as ?pronunciationrecognition?~~
~~(Ristad and Yianilos, 1998) or ?pro nunciation classification?~~
~~(Filali and Bilmes, 2005).)~~
~~As we show below, our approach outperforms both traditional phonetic rule-based models and the best previously published results on our data set obtained with generative articulatory approaches.~~
~~We define a pronunciation of a word as a representa tion of the way it is produced by a speaker in terms of some set of linguistically meaningful sub-wordunits.~~
~~A pronunciation can be, for example, a sequence of phones or multiple sequences of articu latory features such as nasality, voicing, and tongue and lip positions.~~
~~For purposes of this paper, we will assume that a pronunciation is a single sequence ofunits, but the approach applies to other representations.~~
~~We distinguish between two types of pronun ciations of a word: (i) canonical pronunciations, theones typically found in the dictionary, and (ii) surface pronunciations, the ways a speaker may actu ally produce the word.~~
~~In the task of lexical access we are given a surface pronunciation of a word, and our goal is to predict the word.~~
~~Formally, we define a pronunciation as a sequence of sub-word units p = (p1, p2, . . .~~
~~, pK), where pk ? P for all 1 ? k ? K and P is the set of all sub-word units.~~
~~The index k can represent either a fixed-length frame or a variable-length segment.~~
~~P? denotes the set of all finite-length sequences over P . We denote a word by w ? V where V is the vocabulary.~~
~~Our goal is to find a function f : P?~~
~~V that takes as input a surface pronunciation and returns the word from the vocabulary that was spoken.In this paper we propose a discriminative super vised learning approach for learning the function f from a training set of pairs (p, w).~~
~~We aim to find a function f that performs well on the training set as well as on unseen examples.~~
~~Let w?~~
~~= f(p) be the predicted word given the pronunciation p. We assess the quality of the function f by the zero-one loss: if 195 w 6= w?~~
~~then the error is one, otherwise the error iszero.~~
~~The goal of the learning process is to minimize the expected zero-one loss, where the expec tation is taken with respect to a fixed but unknown distribution over words and surface pronunciations.~~
~~In the next section we present a learning algorithm that aims to minimize the expected zero-one loss.~~
~~Similarly to previous work in structured prediction (Taskar et al, 2003; Tsochantaridis et al, 2005), we construct the function f from a predefined set of N feature functions, {?j}Nj=1, each of the form?j : P??V ? R. Each feature function takes a surface pronunciation p and a proposed word w and re turns a scalar which, intuitively, should be correlated with whether the pronunciation p corresponds to the word w. The feature functions map pronunciations of different lengths along with a proposed word to a vector of fixed dimension in RN . For example, onefeature function might measure the Levenshtein dis tance between the pronunciation p and the canonical pronunciation of the word w. This feature functioncounts the minimum number of edit operations (in sertions, deletions, and substitutions) that are needed to convert the surface pronunciation to the canonical pronunciation; it is low if the surface pronunciation is close to the canonical one and high otherwise.~~
~~The function f maximizes a score relating theword w to the pronunciation p. We restrict ourselves to scores that are linear in the feature func tions, where each ?j is scaled by a weight ?j : N?~~
~~j=1 ?j?j(p, w) = ? ?~~
~~?(p, w), where we have used vector notation for the feature functions ? = (?1, . . .~~
~~, ?N ) and for the weights ? = (?1, . . .~~
~~, ?N ).~~
~~Linearity is not a very strongrestriction, since the feature functions can be arbi trarily non-linear.~~
~~The function f is defined as the word w that maximizes the score, f(p) = argmax w?V ? ?~~
~~?(p, w).~~
~~Our goal in learning ? is to minimize the expected zero-one loss: ??~~
~~= argmin ? E(p,w)??~~
~~[ 1w 6=f(p) ] ,where 1pi is 1 if predicate pi holds and 0 other wise, and where ? is an (unknown) distribution from which the examples in our training set are sampled i.i.d. Let S = {(p1, w1), . . .~~
~~, (pm, wm)} be the training set.~~
~~Instead of working directly with the zero-one loss, which is non-smooth and non-convex, we use the surrogate hinge loss, which upper-bounds the zero-one loss: L(?, pi, wi) = max w?V [ 1wi 6=w ? ?~~
~~?(pi, wi) + ? ?~~
~~?(pi, w) ] .~~
~~(1) Finding the weight vector ? that minimizes the `2-regularized average of this loss function is the structured support vector machine (SVM) problem (Taskar et al, 2003; Tsochantaridis et al, 2005): ??~~
~~= argmin ? ?~~
~~2 ???2 + 1 m m?~~
~~i=1 L(?, pi, wi), (2)where ? is a user-defined tuning parameter that bal ances between regularization and loss minimization.~~
~~In practice, we have found that solving thequadratic optimization problem given in Eq.~~
~~(2) con verges very slowly using standard methods such as stochastic gradient descent (Shalev-Shwartz et al, 2007).~~
~~We use a slightly different algorithm, the Passive-Aggressive (PA) algorithm (Crammer et al, 2006), whose average loss is comparable to that of the structured SVM solution (Keshet et al, 2007).~~
~~The Passive-Aggressive algorithm is an efficient online algorithm that, under some conditions, can be viewed as a dual-coordinate ascent minimizer of Eq.~~
~~(2) (The connection to dual-coordinate ascent can be found in (Hsieh et al, 2008)).~~
~~The algorithm begins by setting ? = 0 and proceeds in rounds.~~
~~In the t-th round the algorithm picks an example(pi, wi) from S at random uniformly without re placement.~~
~~Denote by ?t?1 the value of the weightvector before the t-th round.~~
~~Let w?ti denote the pre dicted word for the i-th example according to ?t?1: w?ti = argmax w?V ?t?1 ? ?(pi, w) + 1wi 6=w. Let ??ti = ?(pi, wi) ? ?(pi, w?~~
~~ti).~~
~~Then the algo rithm updates the weight vector ?t as follows: ?t = ?t?1 + ?ti??~~
~~t i (3) 196 where ?ti = min { 1 ?m , 1wi 6=w?ti ? ?~~
~~???ti ???ti?~~
~~} . In practice we iterate over the m examples in the training set several times; each such iteration is an epoch.~~
~~The final weight vector is set to the average over all weight vectors during training.~~
~~An alternative loss function that is often used to solve structured prediction problems is the log-loss: L(?, pi, wi) = ? logP?(wi|pi) (4) where the probability is defined as P?(wi|pi) = e???(pi,wi) ? w?V e ???(p,w) . Minimization of Eq.~~
~~(2) under the log-loss results ina probabilistic model commonly known as a condi tional random field (CRF) (Lafferty et al, 2001).~~
~~By taking the sub-gradient of Eq.~~
~~(4), we can obtain an update rule similar to the one shown in Eq.~~
~~(3).~~
~~Before defining the feature functions, we define some notation.~~
~~Suppose p ? P? is a sequence of sub-word units.~~
~~We use p1:n to denote the n-gram substring p1 . . .~~
~~pn.~~
~~The two substrings a and b are said to be equal if they have the same length andai = bi for 1 ? i ? n. For a given sub-word unit n gram u ? Pn, we use the shorthand u ? p to mean that we can find u in p; i.e., there exists an index i such that pi:i+n = u. We use |p| to denote the length of the sequence p. We assume we have a pronunciation dictionary,which is a set of words and their baseforms.~~
~~We ac cess the dictionary through the function pron, which takes a word w ? V and returns a set of baseforms.~~
~~4.1 TF-IDF feature functions.~~
~~Term frequency (TF) and inverse document fre quency (IDF) are measures that have been heavily used in information retrieval to search for documents using word queries (Salton et al, 1975).~~
~~Similarly to(Zweig et al, 2010), we adapt TF and IDF by treat ing a sequence of sub-word units as a ?document?and n-gram sub-sequences as ?words.?~~
~~In this anal ogy, we use sub-sequences in surface pronunciations to ?search?~~
~~for baseforms in the dictionary.~~
~~These features measure the frequency of each n-gram inobserved pronunciations of a given word in the training set, along with the discriminative power of the n gram.~~
~~These features are therefore only meaningful for words actually observed in training.~~
~~The term frequency of a sub-word unit n-gram u ? Pn in a sequence p is the length-normalized frequency of the n-gram in the sequence: TFu(p) = 1 |p| ? |u|+ 1 |p|?|u|+1?~~
~~i=1 1u=pi:i+|u|?1 . Next, define the set of words in the training set that contain the n-gram u as Vu = {w ? V | (p, w) ? S, u ? p}.~~
~~The inverse document frequency (IDF) of an n-gram u is defined as IDFu = log |V| |Vu| .IDF represents the discriminative power of an n gram: An n-gram that occurs in few words is better at word discrimination than a very common n-gram.~~
~~Finally, we define word-specific features using TF and IDF.~~
~~Suppose the vocabulary is indexed: V = {w1, . . .~~
~~, wn}.~~
~~Define ew as a binary vector with elements (ew)i = 1wi=w. We define the TF-IDF feature function of u as ?u(p, w) = (TFu(p)?~~
~~IDFu)?~~
~~ew, where ? : Ra?b ? Rc?d ? Rac?bd is the tensor product.~~
~~We therefore have as many TF-IDF feature functions as we have n-grams.~~
~~In practice, we only consider n-grams of a certain order (e.g., bigrams).~~
~~The following toy example demonstrates how the TF-IDF features are computed.~~
~~Suppose we have V = {problem, probably}.~~
~~The dictionary maps?problem?~~
~~to /pcl p r aa bcl b l ax m/ and ?prob ably?~~
~~to /pcl p r aa bcl b l iy/, and our input is(p, w) = ([p r aa b l iy], problem).~~
~~Then for the bi gram /l iy/, we have TF/l iy/(p) = 1/5 (one out of five bigrams in p), and IDF/l iy/ = log(2/1) (one word out of two in the dictionary).~~
~~The indicator vector is eproblem = [ 1 0 ]> , so the final feature is ?/l iy/(p, w) = [1 5 log 2 1 0 ] . 197 4.2 Length feature function.~~
~~The length feature functions measure how the length of a word?s surface form tends to deviate from the baseform.~~
~~These functions are parameterized by a and b and are defined as ?a??`<b(p, w) = 1a??`<b ? ew, where ?` = |p| ? |v|, for some baseform v ?pron(w).~~
~~The parameters a and b can be either posi tive or negative, so the model can learn whether the surface pronunciations of a word tend to be longeror shorter than the baseform.~~
~~Like the TF-IDF features, this feature is only meaningful for words ac tually observed in training.~~
~~As an example, suppose we have V = {problem, probably}, and the word ?probably?~~
~~has two baseforms, /pcl p r aa bcl b l iy/ (of length eight) and /pcl p r aa bcl b ax bcl b l iy/ (of length eleven).~~
~~If we are given an input (p, w) = ([pcl p r aa bcl l ax m], probably), whose length of the surface form is eight, then the length features for the ranges 0 ? ?` < 1 and ?3 ? ?` < ?2 are ?0??`<1(p, w) = [ 0 1 ]> , ??3??`<?2(p, w) = [ 0 1 ]> , respectively.~~
~~Other length features are all zero.~~
~~4.3 Phonetic alignment feature functions.~~
~~Beyond the length, we also measure specific phonetic deviations from the dictionary.~~
~~We define pho netic alignment features that count the (normalized)frequencies of phonetic insertions, phonetic deletions, and substitutions of one surface phone for another baseform phone.~~
~~Given (p, w), we use dy namic programming to align the surface form p with all of the baseforms of w. Following (Riley et al, 1999), we encode a phoneme/phone with a 4-tuple: consonant manner, consonant place, vowel manner, and vowel place.~~
~~Let the dash symbol ???~~
~~be agap in the alignment (corresponding to an inser tion/deletion).~~
~~Given p, q ? P ? {?}, we say that a pair (p, q) is a deletion if p ? P and q = ?, isan insertion if p = ? and q ? P , and is a substi tution if both p, q ? P . Given p, q ? P ? {?}, let(s1, s2, s3, s4) and (t1, t2, t3, t4) be the correspond ing 4-tuple encoding of p and q, respectively.~~
~~The pcl p r aa pcl p er l iy pcl p r aa bcl b ? l iy pcl p r aa pcl p er ? ?~~
~~l iy pcl p r aa bcl b ax bcl b l iy Table 1: Possible alignments of [p r aa pcl p er l iy] with two baseforms of ?probably?~~
~~in the dictionary.~~
~~similarity between p and q is defined as s(p, q) = { 1, if p = ? or q = ?; ?4 i=1 1si=ti , otherwise.~~
~~Consider aligning p with the Kw = |pron(w)|baseforms of w. Define the length of the align ment with the k-th baseform as Lk, for 1 ? k ? Kw.~~
~~The resulting alignment is a sequence of pairs (ak,1, bk,1), . . .~~
~~, (ak,Lk , bk,Lk), where ak,i, bk,i ?P ? {?}~~
~~for 1 ? i ? Lk.~~
~~Now we define the align ment features, given p, q ? P ? {?}, as ?p?q(p, w) = 1 Zp Kw?~~
~~k=1 Lk?~~
~~i=1 1ak,i=p, bk,i=q, where the normalization term is Zp = {?Kw k=1 ?Lk i=1 1ak,i=p, if p ? P ; |p| ?Kw if p = ?.~~
~~The normalization for insertions differs from the normalization for substitutions and deletions, so that the resulting values always lie between zero and one.~~
~~As an example, consider the input pair (p, w) = ([p r aa pcl p er l iy], probably) and suppose there are two baseforms of the word ?probably?~~
~~in the dictionary.~~
~~Let one possible alignments be the one shown in Table 1.~~
~~Since /p/ occurs four times in the alignments and two of them are aligned to [b], the feature for p?~~
~~b is then ?p?b(p, w) = 2/4.~~
~~Unlike the TF-IDF feature functions and thelength feature functions, the alignment feature func tions can assign a non-zero score to words that are not seen at training time (but are in the dictionary),as long as there is a good alignment with their baseforms.~~
~~The weights given to the alignment fea tures are the analogue of substitution, insertion, and deletion rule probabilities in traditional phone-based pronunciation models such as (Riley et al, 1999); they can also be seen as a generalized version of the Levenshtein features of (Zweig et al, 2011).~~
~~198 4.4 Dictionary feature function.~~
~~The dictionary feature is an indicator of whether a pronunciation is an exact match to a baseform, which also generalizes to words unseen in training.~~
~~We define the dictionary feature as ?dict(p, w) = 1p?pron(w).~~
~~For example, assume there is a baseform /pcl p r aa bcl b l iy/ for the word ?probably?~~
~~in the dictionary, and p = /pcl p r aa bcl b l iy/.~~
~~Then ?dict(p, probably) = 1, while ?dict(p, problem) = 0.~~
~~4.5 Articulatory feature functions.~~
~~Articulatory models represented as dynamic Bayesian networks (DBNs) have been successful in the past on the lexical access task (Livescu and Glass, 2004; Jyothi et al, 2011).~~
~~In such models, pronunciation variation is seen as the result of asynchrony between the articulators (lips, tongue, etc.) and deviations from the intended articulatory positions.~~
~~Given a sequence p and a word w, we use the DBN to produce an alignment at the articulatory level, which is a sequence of 7-tuples, representing the articulatory variables3 lip opening, tongue tip location and opening, tongue body location and opening, velum opening, and glottis opening.~~
~~We extract three kinds of features from the output?substitutions, asynchrony, and log-likelihood.The substitution features are similar to the pho netic alignment features in Section 4.3, except thatthe alignment is not a sequence of pairs but a se quence of 14-tuples (7 for the baseform and 7 for thesurface form).~~
~~The DBN model is based on articu latory phonology (Browman and Goldstein, 1992), in which there are no insertions and deletions, only substitutions (apparent insertions and deletions areaccounted for by articulatory asynchrony).~~
~~Formally, consider the seven sets of articulatory vari able values F1, . . .~~
~~, F7.~~
~~For example, F1 could beall of the values of lip opening, F1 ={closed, critical, narrow, wide}.~~
~~Let F = {F1, . . .~~
~~, F7}.~~
~~Con sider an articulatory variable F ? F . Suppose the alignment for F is (a1, b1), . . .~~
~~, (aL, bL), where L 3We use the term ?articulatory variable?~~
~~for the ?articulatory features?~~
~~of (Livescu and Glass, 2004; Jyothi et al, 2011), in order to avoid confusion with our feature functions.~~
~~is the length of the alignment and ai, bi ? F , for 1 ? i ? L. Here the ai are the intended articulatory variable values according to the baseform, and the bi are the corresponding realized values.~~
~~For each a, b ? F we define a substitution feature function: ?a?b(p, w) = 1 L L?~~
~~i=1 1ai=a, bi=b. The asynchrony features are also extracted from the DBN alignments.~~
~~Articulators are not always synchronized, which is one cause of pronunciation variation.~~
~~We measure this by looking at the phones that two articulators are aiming to produce, and find the time difference between them.~~
~~Formally, we consider two articulatory variables Fh, Fk ? F . Let the alignment between the two variables be (a1, b1), . . .~~
~~, (aL, bL), where now ai ? Fh and bi ?Fk.~~
~~Each ai and bi can be mapped back to the cor responding phone index th,i and tk,i, for 1 ? i ? L. The average degree of asynchrony is then defined as async(Fh, Fk) = 1 L L?~~
~~i=1 (th,i ? tk,i) . More generally, we compute the average asynchrony between any two sets of variables F1,F2 ? F as async(F1,F2) = 1 L L?~~
~~i=1 ? ?~~
~~1 |F1| ? Fh?F1 th,i ? 1 |F2| ? Fk?F2 tk,i ? ?~~
~~We then define the asynchrony features as ?a?async(F1,F2)?b = 1a?async(F1,F2)?b. Finally, the log-likelihood feature is the DBN alignment score, shifted and scaled so that the value lies between zero and one, ?dbn-LL(p, w) = L(p, w)?~~
~~h c , where L is the log-likelihood function of the DBN, h is the shift, and c is the scale.Note that none of the DBN features are wordspecific, so that they generalize to words in the dic tionary that are unseen in the training set.~~
~~All experiments are conducted on a subset of the Switchboard conversational speech corpus that has 199 been labeled at a fine phonetic level (Greenberg et al., 1996); these phonetic transcriptions are the input to our lexical access models.~~
~~The data subset, phoneset P , and dictionary are the same as ones previ ously used in (Livescu and Glass, 2004; Jyothi et al,2011).~~
~~The dictionary contains 3328 words, consist ing of the 5000 most frequent words in Switchboard, excluding ones with fewer than four phones in their baseforms.~~
~~The baseforms use a similar, slightly smaller phone set (lacking, e.g., nasalization).~~
~~Wemeasure performance by error rate (ER), the propor tion of test examples predicted incorrectly.~~
~~The TF-IDF features used in the experimentsare based on phone bigrams.~~
~~For all of the ar ticulatory DBN features, we use the DBN from (Livescu, 2005) (the one in (Jyothi et al, 2011)is more sophisticated and may be used in future work).~~
~~For the asynchrony features, the ar ticulatory pairs are (F1,F2) ? {({tongue tip}, {tongue body}), ({lip opening}, {tongue tip, tongue body}), and ({lip opening, tongue tip, tongue body}, {glottis, velum})}, as in (Livescu, 2005).~~
~~The parameters (a, b) of the length and asynchrony features are drawn from (a, b) ? {(?3,?2), (?2,?1), . . .~~
~~(2, 3)}.~~
~~We compare the CRF4, Passive-Aggressive (PA), and Pegasos learning algorithms.~~
~~The regularization parameter ? is tuned on the development set.~~
~~We run all three algorithms for multiple epochs and pick the best epoch based on development set performance.~~
~~For the first set of experiments, we use the same division of the corpus as in (Livescu and Glass,2004; Jyothi et al, 2011) into a 2492-word training set, a 165-word development set, and a 236 word test set.~~
~~To give a sense of the difficulty ofthe task, we test two simple baselines.~~
~~One is a lexicon lookup: If the surface form is found in the dic tionary, predict the corresponding word; otherwise,guess randomly.~~
~~For a second baseline, we calcu late the Levenshtein (0-1 edit) distance between the input pronunciation and each dictionary baseform, and predict the word corresponding to the baseform closest to the input.~~
~~The results are shown in the first two rows of Table 2.~~
~~We can see that, by adding justthe Levenshtein distance, the error rate drops signif4We use the term ?CRF?~~
~~since the learning algorithm corresponds to CRF learning, although the task is multiclass classifi cation rather than a sequence or structure prediction task.~~
~~Model ER lexicon lookup (from (Livescu, 2005)) 59.3% lexicon + Levenshtein distance 41.8% (Jyothi et al, 2011) 29.1% CRF/DP+ 21.5% PA/DP+ 15.2% Pegasos/DP+ 14.8% PA/ALL 15.2% Table 2: Lexical access error rates (ER) on the same data split as in (Livescu and Glass, 2004; Jyothi et al, 2011).~~
~~Models labeled X/Y use learning algorithm X and featureset Y. The feature set DP+ contains TF-IDF, DP alignment, dictionary, and length features.~~
~~The set ALL con tains DP+ and the articulatory DBN features.~~
~~The bestresults are in bold; the differences among them are in significant (according to McNemar?s test with p = .05).~~
~~icantly.~~
~~However, both baselines do quite poorly.~~
~~Table 2 shows the best previous result on this data set from the articulatory model of Jyothi et al, which greatly improves over our baselines as well as over a much more complex phone-based model (Jyothi et al, 2011).~~
~~The remaining rows of Table 2 giveresults with our feature functions and various learn ing algorithms.~~
~~The best result for PA/DP+ (the PAalgorithm using all features besides the DBN fea tures) on the development set is with ? = 100 and 5 epochs.~~
~~Tested on the test set, this model improves over (Jyothi et al, 2011) by 13.9% absolute (47.8% relative).~~
~~The best result for Pegasos with the same features on the development set is with ? = 0.01 and 10 epochs.~~
~~On the test set, this model gives a 14.3%absolute improvement (49.1% relative).~~
~~CRF learn ing with the same features performs about 6% worse than the corresponding PA and Pegasos models.~~
~~The single-threaded running time for PA/DP+ andPegasos/DP+ is about 40 minutes per epoch, mea sured on a dual-core AMD 2.4GHz CPU with 8GB of memory; for CRF, it takes about 100 minutes for each epoch, which is almost entirely because the weight vector ? is less sparse with CRF learning.~~
~~In the PA and Pegasos algorithms, we only update ?for the most confusable word, while in CRF learn ing, we sum over all words.~~
~~In our case, the number of non-zero entries in ? for PA and Pegasos is around 800,000; for CRF, it is over 4,000,000.~~
~~Though PA and Pegasos take roughly the same amount of time per epoch, Pegasos tends to require more epochs to 200Figure 1: 5-fold cross validation (CV) results.~~
~~The lexicon lookup baseline is labeled lex; lex + lev = lexicon lookup with Levenshtein distance.~~
~~Each point cor responds to the test set error rate for one of the 5 datasplits.~~
~~The horizontal red line marks the mean of the results with means labeled, and the vertical red line indi cates the mean plus and minus one standard deviation.~~
~~achieve the same performance as PA. For the second experiment, we perform 5-foldcross-validation.~~
~~We combine the training, devel opment, and test sets from the previous experiment, and divide the data into five folds.~~
~~We take three folds for training, one fold for tuning ? and the bestepoch, and the remaining fold for testing.~~
~~The re sults on the test fold are shown in Figure 1, which compares the learning algorithms, and Figure 2, which compares feature sets.~~
~~Overall, the resultsare consistent with our first experiment.~~
~~The fea ture selection experiments in Figure 2 shows that the TF-IDF features alone are quite weak, while the dynamic programming alignment features alone are quite good.~~
~~Combining the two gives close to our best result.~~
~~Although the marginal improvement getssmaller as we add more features, in general perfor mance keeps improving the more features we add.~~
~~The results in Section 5 are the best obtained thus far on the lexical access task on this conversationaldata set.~~
~~Large-margin learning, using the Passive Aggressive and Pegasos algorithms, has benefits over CRF learning for our task: It produces sparser models, is faster, and produces better lexical access results.~~
~~In addition, the PA algorithm is faster than Pegasos on our task, as it requires fewer epochs.~~
~~Our ultimate goal is to incorporate such models into complete speech recognizers, that is to predict word sequences from acoustics.~~
~~This requires (1)Figure 2: Feature selection results for five-fold cross val idation.~~
~~In the figure, phone bigram TF-IDF is labeledp2; phonetic alignment with dynamic programming is la beled DP.~~
~~The dots and lines are as defined in Figure 1.~~
~~extension of the model and learning algorithm toword sequences and (2) feature functions that re late acoustic measurements to sub-word units.~~
~~The extension to sequences can be done analogously to segmental conditional random fields (SCRFs).~~
~~The main difference between SCRFs and our approach would be the large-margin learning, which can bestraightforwardly applied to sequences.~~
~~To incorpo rate acoustics, we can use feature functions based on classifiers of sub-word units, similarly to previouswork on CRF-based speech recognition (Gunawar dana et al, 2005; Morris and Fosler-Lussier, 2008; Prabhavalkar et al, 2011).~~
~~Richer, longer-span (e.g., word-level) feature functions are also possible.Thus far we have restricted the pronunciation-toword score to linear combinations of feature functions.~~
~~This can be extended to non-linear combi nations using a kernel.~~
~~This may be challenging in a high-dimensional feature space.~~
~~One possibility is to approximate the kernels as in (Keshet et al, 2011).~~
~~Additional extensions include new featurefunctions, such as context-sensitive alignment features, and joint inference and learning of the align ment models embedded in the feature functions.~~
~~AcknowledgmentsWe thank Raman Arora, Arild N?ss, and the anonymous reviewers for helpful suggestions.~~
~~This research was supported in part by NSF grant IIS 0905633.~~
~~The opinions expressed in this work are those of the authors and do not necessarily reflect the views of the funding agency.~~
~~201~~