~~Starting from first principles, we re-visit the statistical approach and study two forms of the Bayes decision rule: the common rule for minimizing the number of string errors and a novel rule for minimizing the number of symbols errors.~~
~~The Bayes decision rule for minimizing the number of string errors is widely used, e.g. in speech recognition, POS tagging and machine translation, but its justification is rarely questioned.~~
~~To minimize the number of symbol errors as is more suitable for a task like POS tagging, we show that another form of the Bayes decision rule can be derived.~~
~~The major purpose of this paper is to show that the form of the Bayes decision rule should not be taken for granted (as it is done in virtually all statistical NLP work), but should be adapted to the error measure being used.~~
~~We present first experimental results for POS tagging tasks.~~
~~Meanwhile, the statistical approach to natural language processing (NLP) tasks like speech recognition, POS tagging and machine translation has found widespread use.~~
~~There are three ingredients to any statistical approach to NLP, namely the Bayes decision rule, the probability models (like trigram model, HMM, ...)~~
~~and the training criterion (like maximum likelihood, mutual information, ...).~~
~~The topic of this paper is to re-consider the form of the Bayes decision rule.~~
~~In virtually all NLP tasks, the specific form of the Bayes decision rule is never questioned, and the decision rule is adapted from speech recognition.~~
~~In speech recognition, the typical decision rule is to maximize the sentence probability over all possible sentences.~~
~~However, this decision rule is optimal for the sentence error rate and not for the word error rate.~~
~~This difference is rarely studied in the literature.As a specific NLP task, we will consider part of-speech (POS) tagging.~~
~~However, the problem addressed comes up in any NLP task which is tackled by the statistical approach and which makes use of a Bayes decision rule.~~
~~Other prominent examples are speech recognition and machine translation.~~
~~The advantage of the POS tagging task is that it will be easier to handle from themathematical point of view and will result in closed form solutions for the decision rules.~~
~~From this point-of-view, the POS tagging task serves as a good opportunity to illustrate the key concepts of the statistical approach to NLP.~~
~~Related Work: For the task of POS tagging, statistical approaches were proposed already in the 60?s and 70?s (Stolz et al, 1965; Bahl and Mercer, 1976), before they started to find widespread use in the 80?s (Beale, 1985; DeRose, 1989; Church, 1989).~~
~~To the best of our knowledge, the ?standard?~~
~~version of the Bayes decision rule, which minimizes the number of string errors, is used in virtually all approaches to POS tagging and other NLP tasks.~~
~~There are only two research groups that do not take this type of decision rule for granted: (Merialdo, 1994): In the context of POS tagging, the author introduces a method that he calls maximum likelihood tagging.~~
~~The spirit of this method is similar to that of this work.~~
~~However, this method is mentioned as an aside and its implications for the Bayes decision rule and the statistical approach are not addressed.~~
~~Part of this work goes back to (Bahl et al, 1974) who considered a problem in coding theory.~~
~~(Goel and Byrne, 2003): The error measure considered by the authors is the word error rate in speech recognition, i.e. the edit distance.~~
~~Due to the mathematical complexity of this error measure, the authors resort to numeric approximations to compute the Bayes risk (see next section).~~
~~Since this approach does not results in explicit closed-form equations and involves many numeric approximations, it is not easy to draw conclusions from this work.~~
~~Rate 2.1 The Bayes Posterior Risk.~~
~~Knowing that any task in NLP tasks is a difficult one, we want to keep the number of wrong decisions as small as possible.~~
~~This point-of-view has been used already for more than 40 years in pattern classification as the starting point for many techniques in pattern classification.~~
~~To classify an observation vector y into one out of several classes c, we resort to the so-called statistical decision theory and try to minimize the average risk or loss in taking a decision.~~
~~The result is known as Bayes decision rule (Chapter 2 in (Duda and Hart, 1973)): y ? c? = argmin c { ? c? Pr(c|y) ? L[c, c?]~~
~~} where L[c, c?]~~
~~is the so-called loss function or error measure, i.e. the loss we incur in making decision c when the true class is c?.~~
~~In the following, we will consider two specific forms of the loss function or error measure L[c, c?].~~
~~The first will be the measure for string errors, which is the typical loss function used in virtually all statistical approaches.~~
~~The second is the measure for symbol errors, which is the more appropriate measure for POS tagging and also speech recognition with no insertion and deletion errors (such as isolated word recognition).~~
~~2.2 String Error.~~
~~For POS tagging, the starting point is the observed sequence of words y = wN1 = w1...wN , i.e. the sequence of words for which the POS tag sequence has c = gN1 = g1...gN has to be determined.~~
~~The first error measure we consider is the string error: the error is equal to zero only if the POS symbols of the two strings are identical at each position.~~
~~In this case, the loss function is: L[gN1 , g?N1 ] = 1 ? N ? n=1 ?(gn, g?n) with the Kronecker delta ?(c, c?).~~
~~In other words, the errors are counted at the string level and not at the level of single symbols.~~
~~Inserting this cost function into the Bayes risk (see Section 2.1), we immediately obtain the following form of Bayes decision rule for minimum string error: wN1 ? g?N1 = argmax gN1 { Pr(gN1 |wN1 ) } = argmax gN1 { Pr(gN1 , wN1 ) } This is the starting point for virtually all statistical approaches in NLP like speech recognition and machine translation.~~
~~However, this decision rule is only optimal when we consider string errors, e.g. sentence error rate in POS tagging and in speech recognition.~~
~~In practice, however, the empirical errors are counted at the symbol level.~~
~~Apart from (Goel and Byrne, 2003), this inconsistency of decision rule and error measure is never addressed in the literature.~~
~~2.3 Symbol Error.~~
~~Instead of the string error rate, we can also consider the error rate of single POS tag symbols (Bahl et al., 1974; Merialdo, 1994).~~
~~This error measure is defined by the loss function: L[gN1 , g?N1 ] = N ? n=1 [1 ? ?(gn, g?n)] This loss function has to be inserted into the Bayes decision rule in Section 2.1.~~
~~The computation of the expected loss, i.e. the averaging over all classes c?~~
~~= g?N1 , can be performed in a closed form.~~
~~We omit the details of the straightforward calculations and state only the result.~~
~~It turns out that we will need the marginal (and posterior) probability distribution Prm(g|wN1 ) at positions m = 1, ..., N : Prm(g|wN1 ) := ? gN1 : gm=g Pr(gN1 |wN1 ) where the sum is carried out over all POS tag strings gN1 with gm = g, i.e. the tag gm at position m is fixed at gm = g. The question of how to perform this summation efficiently will be considered later after we have introduced the model distributions.~~
~~Thus we have obtained the Bayes decision rule for minimum symbol error at position m = 1, ..., N : (wN1 ,m) ? g?m = argmaxg { Prm(g|wN1 ) } = argmax g { Prm(g,wN1 ) } By construction this decision rule has the special property that it does not put direct emphasis on local coherency of the POS tags produced.~~
~~In other words, this decision rule may produce a POS tag string which is linguistically less likely.~~
~~Tagging The derivation of the Bayes decision rule assumes that the probability distribution Pr(gN1 , wN1 ) (or Pr(gN1 |wN1 )) is known.~~
~~Unfortunately, this is not the case in practice.~~
~~Therefore, the usual approach is to approximate the true but unknown distribution by a model distribution p(gN1 , wN1 ) (or p(gN1 |wN1 )).~~
~~We will review two popular modelling approaches, namely the generative model and the direct model, and consider the associated Bayes decision rules for both minimum string error and minimum symbol error.~~
~~3.1 Generative Model: Trigram Model.~~
~~We replace the true but unknown joint distributionPr(gN1 , wN1 ) by a model-based probability distribu tion p(gN1 , wN1 ): Pr(gN1 , wN1 ) ? p(gN1 , wN1 ) = p(gN1 ) ? p(wN1 |gN1 ) We apply the so-called chain rule to factorize each of the distributions p(gN1 ) and p(wN1 |gN1 ) into a product of conditional probabilities using specific dependence assumptions: p(gN1 , wN1 ) = N ? n=1 [ p(gn|gn?1n?2) ? p(wn|gn) ] with suitable definitions for the case n = 1.~~
~~Here, the specific dependence assumptions are that the conditional probabilities can be represented by a POS trigram model p(gn|gn?1n?2) and a word membership model p(wn|gn).~~
~~Thus we obtain a probability model whose structure fits into the mathematical framework of so-called Hidden Markov Model (HMM).~~
~~Therefore, this approach is often also referred to as HMM-based POS tagging.~~
~~However, this terminology is misleading: The POS tag sequence is observable whereas in the Hidden Markov Model the state sequence is always hidden and cannot be observed.~~
~~In the experiments, we will use a 7-gram POS model.~~
~~It is clear how to extend the equations from the trigram case to the 7-gram case.~~
~~3.1.1 String Error Using the above model distribution, we directly obtain the decision rule for minimum string error: wN1 ? g?N1 = argmax gN1 { p(gN1 , wN1 ) }Since the model distribution is a basically a second order model (or trigram model), there is an efficient algorithm for finding the most probable POS tag string.~~
~~This is achieved by a suitable dynamic programming algorithm, which is often referred to as Viterbi algorithm in the literature.~~
~~3.1.2 Symbol Error To apply the Bayes decision rule for minimum symbol error rate, we first compute the marginal probability pm(g,wN1 ): pm(g,wN1 ) = ? gN1 : gm=g p(gN1 , wN1 ) = ? gN1 : gm=g ? n [ p(gn|gn?1n?2) ? p(wn|gn) ] Again, since the model is a second-order model, the sum over all possible POS tag strings gN1(with gm = g) can be computed efficiently using a suitable extension of the forward-backward algorithm (Bahl et al, 1974).~~
~~Thus we obtain the decision rule for minimum symbol error at positions m = 1, ..., N : (wN1 ,m) ? g?m = argmaxg { pm(g,wN1 ) } Here, after the the marginal probability pm(g,wN1 ) has been computed, the task of finding the most probable POS tag at position m is computationally easy.~~
~~Instead, the lion?s share for the computationaleffort is required to compute the marginal probabil ity pm(g,wN1 ).~~
~~3.2 Direct Model: Maximum Entropy.~~
~~We replace the true but unknown posterior distri bution Pr(gN1 |wN1 ) by a model-based probability distribution p(gN1 |wN1 ): Pr(gN1 |wN1 ) ? p(gN1 |wN1 ) and apply the chain rule: p(gN1 |wN1 ) = N ? n=1 p(gn|gn?11 , wN1 ) = N ? n=1 p(gn|gn?1n?2 , wn+2n?2) As for the generative model, we have made specific assumptions: There is a second-order dependence for the tags gn1 , and the dependence on the words wN1 is limited to a window wn+2n?2 around position n. The resulting model is still rather complex and requires further specifications.~~
~~The typical procedure is to resort to log-linear modelling, which is also referred to as maximum entropy modelling (Ratnaparkhi, 1996; Berger et al, 1996).~~
~~3.2.1 String Error For the minimum string error, we obtain the decision rule: wN1 ? g?N1 = argmax gN1 { p(gN1 |wN1 ) } Since this is still a second-order model, we can use dynamic programming to compute the most likely POS string.~~
~~3.2.2 Symbol Error For the minimum symbol error, the marginal (and posterior) probability pm(g|wN1 ) has to be computed: pm(g|wN1 ) = ? gN1 : gm=g Pr(gN1 |wN1 ) = ? gN1 : gm=g ? n p(gn|gn?1n?2 , wn+2n?2) which, due to the specific structure of the model p(gn|gn?1n?2 , wn+2n?2), can be calculated efficiently using only a forward algorithm (without a ?backward?~~
~~part).~~
~~Thus we obtain the decision rule for minimum symbol error at positions m = 1, ..., N : (wN1 ,m) ? g?m = argmaxg { pm(g|wN1 ) } As in the case of the generative model, the computational effort is to compute the posterior probability pm(g|wN1 ) rather than to find the most probable tag at position m.~~
~~So far, we have said nothing about how we train the free parameters of the model distributions.~~
~~We use fairly conventional training procedures that we mention only for the sake of completeness.~~
~~4.1 Generative Model.~~
~~We consider the trigram-based model.~~
~~The free parameters here are the entries of the POS trigram distribution p(g|g??, g?)~~
~~and of the word membership distribution p(w|g).~~
~~These unknown parameters are computed from a labelled training corpus, i.e. a collection of sentences where for each word the associated POS tag is given.~~
~~In principle, the free parameters of the models are estimated as relative frequencies.~~
~~For the testdata, we have to allow for both POS trigrams (or n grams) and (single) words that were not seen in the training data.~~
~~This problem is tackled by applying smoothing methods that were originally designed for language modelling in speech recognition (Ney et al, 1997).~~
~~4.2 Direct Model.~~
~~For the maximum entropy model, the free param eters are the so-called ?i or feature parameters (Berger et al, 1996; Ratnaparkhi, 1996).~~
~~The training criterion is to optimize the logarithm of the model probabilities p(gn|gn?2n?1 , wn+2n?2) over all positions n in the training corpus.~~
~~The corresponding algorithm is referred to as GIS algorithm (Berger et al, 1996).~~
~~As usual with maximum entropy models, the problem of smoothing does not seem to be critical and is not addressed explicitly.~~
~~Of course, there have already been many papers about POS tagging using statistical methods.~~
~~The goal of the experiments is to compare the two decision rules and to analyze the differences in performance.~~
~~As the results for the WSJ corpus will show, both the trigram method and the maximum entropy method have an tagging error rate of 3.0% to 3.5% and are thus comparable to the best results reported in the literature, e.g.~~
~~(Ratnaparkhi, 1996).~~
~~5.1 Task and Corpus.~~
~~The experiments are performed on the Wall Street Journal (WSJ) English corpus and on the Mu?nster Tagging Project (MTP) German corpus.~~
~~The POS tagging part of The WSJ corpus (Table 1) was compiled by the University of Pennsylvania and consists of about one million English words with manually annotated POS tags.~~
~~Text POS Train Sentences 43508 Words+PMs 1061772 Singletons 21522 0 Word Vocabulary 46806 45 PM Vocabulary 25 9 Test Sentences 4478 Words+PMs 111220 OOVs 2879 0 Table 1: WSJ corpus statistics.~~
~~The MTP corpus (Table 2) was compiled at the University of Mu?nster and contains tagged German words from articles of the newspapers Die Zeit and Frankfurter Allgemeine Zeitung (Kinscher and Steiner, 1995).~~
~~For the corpus statistics, it is helpful to distinguish between the true words and the punctuation marks (see Table 1 and Table 2).~~
~~This distinction is made for both the text and the POS corpus.~~
~~In addition, the tables show the vocabulary size (number of different tokens) for the words and for the punctuation marks.~~
~~Punctuation marks (PMs) are all tokens which do not contain letters or digits.~~
~~The total number of running tokens is indicated as Words+PMs.~~
~~Singletons are the tokens which occur only once in Text POS Train Sentences 19845 Words+PMs 349699 Singletons 32678 11 Word Vocabulary 51491 68 PM Vocabulary 27 5 Test Sentences 2206 Words+PMs 39052 OOVs 3584 2 Table 2: MTP corpus statistics.~~
~~the training data.~~
~~Out-of-Vocabulary words (OOVs) are the words in the test data that did not not occur in the training corpus.~~
~~5.2 POS Tagging Results.~~
~~The tagging experiments were performed for both types of models, each of them with both types of the decision rules.~~
~~The generative model is based on the approach described in (Su?ndermann and Ney, 2003).~~
~~Here the optimal value of the n-gram order is determined from the corpus statistics and has a maximum of n = 7.~~
~~The experiments for the direct model were performed using the maximum entropy tagger described in (Ratnaparkhi, 1996).~~
~~The tagging error rates are showed in Table 3 and Table 4.~~
~~In addition to the overall tagging error rate (Overall), the tables show the tagging error rates for the Out-of-Vocabulary words (OOVs) and for the punctuation marks (PMs).~~
~~For the generative model, both decision rules yield similar results.~~
~~For the direct model, the overall tagging error rate increases on each of the two tasks (from 3.0 % to 3.3 % on WSJ and from 5.4 % to 5.6 % on MTP) when we use the symbol decision rule instead of the string decision rule.~~
~~In particular, for OOVs, the error rate goes up clearly.~~
~~Right now, we do not have a clear explanation for this difference between the generative model and the direct model.~~
~~It might be related to the ?forward?~~
~~structure of the direct model as opposed to the ?forward-backward?~~
~~structure of the generative model.~~
~~Anyway, the refined bootstrap method (Bisani and Ney, 2004) has shown that differences in the overall tagging error rate are statistically not significant.~~
~~5.3 Examples.~~
~~A detailed analysis of the tagging results showed that for both models there are sentences where the one decision rule is more efficient and sentences where the other decision rule is better.~~
~~For the generative model, these differences seem to occur at random, but for the direct model, some distinct tendencies can be observed.~~
~~For example, WSJ Task Decision Overall OOVs PMs Rule Generative string 3.5 16.9 0 Model symbol 3.5 16.7 0 Direct string 3.0 15.4 0.08 Model symbol 3.3 16.6 0.1 Table 3: POS tagging error rates [%] for WSJ task.~~
~~MTP Task Decision Overall OOVs PMs Rule Generative string 5.4 13.4 3.6 Model symbol 5.4 13.4 3.6 Direct string 5.4 12.7 3.8 Model symbol 5.6 13.4 3.7 Table 4: POS tagging error rates [%] for MTP task.~~
~~for the WSJ corpus, the string decision rule is significantly better for the present and past tense of verbs (VBP, VBN), and the symbol decision rule is better for adverb (RB) and verb past participle (VBN).~~
~~Typical errors generated by the symbol decision rule are tagging present tense as infinitive (VB) and past tense as past participle (VBN), and for string decision rule, adverbs are often tagged as preposition (IN) or adjective (JJ) and past participle as past tense (VBD).~~
~~For the German corpus, the string decision rule better handles demonstrative determiners (Rr) and subordinate conjunctions (Cs) whereas symbol decision rule is better for definite articles (Db).~~
~~The symbol decision rule typically tags the demonstrative determiner as definite article (Db) and subordinate conjunctions as interrogative adverbs (Bi), and the string decision rule tends to assign the demonstrative determiner tag to definite articles.~~
~~These typical errors for the symbol decision rule are shown in Table 5, and for the string decision rule in Table 6.~~
~~So far, the experimental tests have shown no improvement when we use the Bayes decision rule for minimizing the number of symbol errors rather than the number of string errors.~~
~~However, the important result is that the new approach results in comparable performance.~~
~~More work is needed to contrast the two approaches.~~
~~The main purpose of this paper has been to show that, in addition to the widely used decision rule for minimizing the string errors, it is possible to derive a decision rule for minimizing the number of symbol errors and to build up the associated mathematical framework.~~
~~There are a number of open questions for future work: 1) The error rates for the two decision rules are comparable.~~
~~Is that an experimental coincidence?~~
~~Are there situations for which we must expect a significance difference between the two decision rules?~~
~~We speculate that the two decision rules could always have similar performance if the error rates are small.~~
~~2) Ideally, the training criterion should be closely related to the error measure used in the decision rule.~~
~~Right now, we have used the training criteria that had been developed in the past and that had been (more or less) designed for the string error rate as error measure.~~
~~Can we come up with a training criterion tailored to the symbol error rate?~~
~~3) In speech recognition and machine translation, more complicated error measures such as the edit distance and the BLEU measure are used.~~
~~Is it possible to derive closed-form Bayes decision rules (or suitable analytic approximations) for these error measures?~~
~~What are the implications?~~