~~One may need to build a statistical parser for a new language, using only a very small labeled treebank together with raw text.~~
~~We argue that bootstrapping a parser is most promisingwhen the model uses a rich set of redundant features, as in re cent models for scoring dependency parses (McDonald et al, 2005).~~
~~Drawing on Abney?s (2004) analysis of the Yarowskyalgorithm, we perform bootstrapping by entropy regularization: we maximize a linear combination of conditional likeli hood on labeled data and confidence (negative Re?nyi entropy) on unlabeled data.~~
~~In initial experiments, this surpassed EM for training a simple feature-poor generative model, and alsoimproved the performance of a feature-rich, conditionally esti mated model where EM could not easily have been applied.~~
~~Forour models and training sets, more peaked measures of con fidence, measured by Re?nyi entropy, outperformed smoother ones.~~
~~We discuss how our feature set could be extended withcross-lingual or cross-domain features, to incorporate knowl edge from parallel or comparable corpora during bootstrapping.~~
~~In this paper, we address the problem of bootstrapping new statistical parsers for new languages, gen res, or domains.Why is this problem important?~~
~~Many applica tions of multilingual NLP require parsing in order to extract information, opinions, and answers from text, and to produce improved translations.~~
~~Yetan adequate labeled training corpus?a large treebank of manually constructed parse trees of typical sentences?is rarely available and would be pro hibitively expensive to develop.~~
~~We show how it is possible to train instead from a small hand-labeled treebank in the target domain,together with a large unannotated collection of in domain sentences.~~
~~Additional resources such asparsers for other domains or languages can be in tegrated naturally.Dependency parsing is important as a key component in leading systems for information extrac tion (Weischedel, 2004)1 and question answering (Peng et al, 2005).~~
~~These systems rely on edgesor paths in dependency parse trees to define their ex traction patterns and classification features.~~
~~Parsingis also key to the latest advances in machine transla tion, which translate syntactic phrases (Galley et al, 2006; Marcu et al, 2006; Cowan et al, 2006).~~
~~Our approach rests on three observations: ? Recent ?feature-based?~~
~~parsing models are an excellent fit for bootstrapping, because theparse is often overdetermined by many redun dant features.~~
~~The feature-based framework is flexibleenough to incorporate other sources of guid ance during training or testing?such as the knowledge contained in a parser for another language or domain.?~~
~~Maximizing a combination of likelihood on la beled data and confidence on unlabeled data is a principled approach to bootstrapping.~~
~~2.1 Feature-Based Parsing.~~
~~McDonald et al (2005) introduced a simple, flexi ble framework for scoring dependency parses.~~
~~Each directed edge e in the dependency tree is described with a high-dimensional feature vector f(e).~~
~~The edge?s score is the dot product f(e) ? ?, where ? is alearned weight vector.~~
~~The overall score of a depen dency tree is the sum of the scores of all edges in the tree.1Ralph Weischedel (p.c.) reports that this system?s perfor mance degrades considerably when only phrase chunking is available rather than full parsing.~~
~~667 Given an n-word input sentence, the parser begins by scoring each of the O(n2) possible edges, and then seeks the highest-scoring legal dependency tree formed by any n?~~
~~1 of these edges, using an O(n3) dynamic programming algorithm (Eisner, 1996) for projective trees.~~
~~For non-projective parsing, O(n3),or with some trickery O(n2), greedy algorithms ex ist (Chu and Liu, 1965; Edmonds, 1967; Gabow et al., 1986).~~
~~The feature function f may pay attention to many properties of the directed edge e. Of course, features may consider the parent and child words connected by e, and their parts of speech.2 But some features used by McDonald et al (2005) also consider the parts of speech of words adjacent to the parent and child, or between the parent and child, as well as the number of words between the parent and child.~~
~~Ingeneral, these features are not available in a genera tive model such as a PCFG.~~
~~Although feature-based models are often trained purely discriminatively, we will see in ?2.6 how to train them to model conditional probabilities.~~
~~2.2 Feature-Based Parsing and Bootstrapping.~~
~~The above parsing model is robust, thanks to its many features.~~
~~On the Penn Treebank WSJ sections 02?21, for example, McDonald?s parser extracts 5.5 million feature types from supervised edges alone, with about 120 feature tokens firing per edge.~~
~~The highest-scoring parse tree represents a consensus among all features on all prospective edges.~~
~~Even if a prospective edge has some discouraging features (i.e., with negative or zero weights), it may still have a relatively high score thanks to its other features.~~
~~Furthermore, even if the edge has a low total score,it may still appear in the consensus parse if the al ternatives are even worse or are incompatible with other high-scoring edges.~~
~~Put another way, the parser is not able to include high-scoring features or edges independently of one another.~~
~~Selecting a good feature means acceptingall other features on that edge.~~
~~It also means rejecting various other edges, because of the global con straints that a legal parse tree must give each word only one parent and must be free of cycles and, in 2Note that since we are not trying to predict parts of speech, we treat the output of one or more automatic taggers as yet more inputs to edge feature functions.~~
~~the projective case, crossings.~~
~~Our observation is that this situation is ideal forso-called ?bootstrapping,?~~
~~?co-training,?~~
~~or ?min imally supervised?~~
~~learning methods (Yarowsky,1995; Blum and Mitchell, 1998; Yarowsky and Wi centowski, 2000).~~
~~Such methods should thrive whenthe right answer is overdetermined owing to redun dant features and/or global constraints.Concretely, suppose we start by training a supervised parser on only 100 examples, using some reg ularization method to prevent overfitting to this set.~~
~~While many features might truly be relevant to the task, only a few appear often enough in this smalltraining set to acquire significantly positive or nega tive weights.~~
~~Even this lightly trained parser may be quite sureof itself on some test sentences in a large unanno tated corpus, when one parse scores far higher than all others.~~
~~More generally, the parser may be sureabout part of a sentence: it may be certain that a par ticular edge is present (or absent), because that edge tends to be present (or absent) in all high-scoring parses.Retraining the feature weights ? on these high confidence edges can learn about additional features that are correlated with an edge?s success or failure.~~
~~For example, it may now learn strong weights for lexically specific features that were never observed in the supervised training set.~~
~~The retrained parser may now be able to confidently parse even more ofthe unannotated examples; so we can iterate the pro cess.~~
~~Our hope is that the model identifies new good and bad edges at each step, and does so correctly.~~
~~The more features and global constraints the model has, ? the more power it will have to discriminate among edges even when ? is insufficiently trained.~~
~~(Some feature weights may be too weak (i.e., too close to zero) because the initial labeled set is small.)~~
~~the more robust it will be against errors even when ? is incorrectly trained.~~
~~(Some feature weights may be too strong or have the wrong sign, because of overfitting or mistaken parses during bootstrapping.)~~
~~668 In the former case, strong features lend their strength to weak ones.~~
~~In the latter case, a conflict among strong features weakens the ones that depart from the consensus, or discounts the example sentence if there is no consensus.~~
~~Previous work on parser bootstrapping has not been able to exploit this redundancy among features, because it has used PCFG-like models with far fewer features (Steedman et al, 2003).~~
~~2.3 Adaptation and Projection via Features.~~
~~The previous section assumed that we had a smallsupervised treebank in the target language and do main (plus a large unsupervised corpus).~~
~~We now consider other, more dubious, knowledge sourcesthat might supplement or replace this small tree bank.~~
~~In each case, we can use these knowledge sources to derive features that may?or may not?~~
~~prove trustworthy during bootstrapping.~~
~~Parses from a different domain.~~
~~One might havea treebank for a different domain or genre of the tar get language.One could simply include these trees in the ini tial supervised training, and hope that bootstrapping corrects any learned weights that are inappropriate to the target domain, as discussed above.~~
~~In fact, McClosky et al (2006) found a similar technique to be effective?though only in a model with a large feature space (?PCFG + reranking?), as we would predict.~~
~~However, another approach is to train a separateout-of-domain parser, and use this to generate addi tional features on the supervised and unsupervised in-domain data (Blitzer et al, 2006).~~
~~Bootstrapping now teaches us where to trust the out-of-domain parser.~~
~~If our basic model has 100 features, we could add features 101 through 200, where for examplef123(e) = f23 ? log P?r(e) and P?r(e) is the poste rior edge probability according to the out-of-domain parser.~~
~~Learning that this feature has a high weight means learning to trust the out-of-domain parser?s decision on edges where in-domain feature 23 fires.~~
~~Even more sensibly, we could add features such as f201(e) = ?10i=1 f?i(e) ? ??i, where f?~~
~~and ??~~
~~are the fea ture and weight vectors for the out-of-domain parser.~~
~~Learning that this feature has a high weight means learning to trust the out-of-domain parser?s featureweights for a particular class of features (those num bered 1 through 10).~~
~~This addresses the intuition thatsome linguistic phenomena remain stable across do mains.Parses of translations.~~
~~Suppose we have translations into English of some of our supervised or unsu pervised sentences.~~
~~Good probabilistic dependency parsers already exist for English, so we run one over the English translation.~~
~~We can now derive manyadditional features on candidate edges on the tar get sentence.~~
~~For example, dependency edges in the target language of the form c poss ??~~
~~p (this denotes a child-to-parent dependency with label possessor) might often correspond to dependency paths in the English translation of the form p?~~
~~prep ??~~
~~of pobj ??~~
~~c?.~~
~~To discover whether this is so, we define a feature i by fi(c poss ??~~
~~p) def= log ? c?,p?~~
~~(Pr(c aligns with c?)~~
~~?Pr(p aligns with p?)~~
~~?Pr(p? prep ??~~
~~of pobj ??~~
~~c?)) (1) where c?, p?~~
~~range over word tokens in the English translation, ?of?~~
~~is a literal English word, and theprobabilities are posteriors provided by a probabilis tic aligner and a probabilistic English parser.~~
~~Notethat this is a single feature (not a feature family pa rameterized by c, p).~~
~~It scores any candidate edge on whether it is a poss ??~~
~~edge that seems to align to an English prep ??~~
~~of pobj ??~~
~~path.~~
~~This method is inspired by Hwa et al (2005), who bootstrapped parsers for Spanish and Chineseby projecting dependencies from English transla tions and training a new parser on the resulting noisy treebank.~~
~~They used only 1-best translations, 1-best alignments, dependency paths of length 1, and no labeled data in Spanish or Chinese.Hwa et al (2005) used a manually written postprocessor to correct some of the many incorrect projections.~~
~~By contrast, our framework uses the pro jected dependencies only as one source of features.They may be overridden by other features in particu lar cases, and will be given a high weight only if theytend to agree with other features during bootstrap ping.~~
~~A similar soft projection of dependencies was used in supervised machine translation by Smith andEisner (2006), who used a source sentence?s depen dency paths to bias the generation of its translation.~~
~~669 Note that these bilingual features will only fire on those supervised or unsupervised sentences forwhich we have an English translation.~~
~~In particu lar, they will usually be unavailable on the test set.~~
~~However, we hope that they will seed and facilitate the bootstrapping process, by helping us confidently parse some unsupervised sentences that we would not be able to confidently parse without an English translation.~~
~~Parses of comparable English sentences.~~
~~World knowledge can be useful in parsing.~~
~~Suppose you see a French sentence that contains mangeons and pommes, and you know that manger=eat and pomme=apple.~~
~~You might reasonably guess that pommes is the direct object of mangeons, because you know that apple is a plausible direct object foreat.~~
~~We can discover this last bit of world knowledge from comparable English text.~~
~~Translation dictionaries can themselves be induced from compara ble corpora (Schafer and Yarowsky, 2002; Schafer, 2006; Klementiev and Roth, 2006), or extracted from bitext or digitized versions of human-readable dictionaries if these are available.~~
~~The above inference pattern can be captured byfeatures similar to those in equation (1).~~
~~For exam ple, one can define a feature j by fi(c poss ??~~
~~p) def= log Pr (p? prep ??~~
~~of pobj ??~~
~~c? | p?~~
~~translates p, c?~~
~~translates c) (2) where each event in the event space is a pair (c?, p?)~~
~~of same-sentence tokens in comparable English text, all pairs being equally likely.~~
~~Thus, to estimate Pr(?~~
~~| ?), the denominator counts same-sentencetoken pairs (c?, p?)~~
~~in the comparable English corpus that translate into the types (c, p), and the nu merator counts such pairs that are also related by a prep ??~~
~~of pobj??~~
~~path.~~
~~Since the lexical transla tions and dependency paths are typically not labeled in the English corpus, a given pair must be counted fractionally according to its posterior probability ofsatisfying these conditions, given models of contex tual translation and English parsing.3 3Similarly, Jansche (2005) imputes ?missing?~~
~~trees by using comparable corpora.~~
~~2.4 Bootstrapping as Optimization.~~
~~Section 2.2 assumed a relatively conventional kind of bootstrapping, where each iteration retrains the model on the examples where it is currently most confident.~~
~~This kind of ?confidence thresholding?~~
~~has been popular in previous bootstrapping work (ascited in ?2.2).~~
~~It attempts to maintain high accuracy while gradually expanding coverage.~~
~~The as sumption is that throughout the training procedure, the parser?s confidence is a trustworthy guide to its correctness.~~
~~Different bootstrapping procedures use different learners, smoothing methods, confidencemeasures, and procedures for ?forgetting?~~
~~the label ings from previous iterations.~~
~~In his analysis of Yarowsky (1995), Abney (2004) formulates several variants of bootstrapping.~~
~~These are shown to increase either the likelihood of the training data, or a lower bound on that likelihood.~~
~~Inparticular, Abney defines a function K that is an up per bound on the negative log-likelihood, and shows his bootstrapping algorithms locally minimize K. We now present a generalization of Abney?s K function and relate it to another semi-supervised learning technique, entropy regularization (Brand, 1999; Grandvalet and Bengio, 2005; Jiao et al, 2006).~~
~~Our experiments will tune the feature weight vector, ?, to minimize our function.~~
~~We will do so simply by applying a generic function minimization method (stochastic gradient descent), rather than bycrafting a new Yarowsky-style or Abney-style itera tive procedure for our specific function.Suppose we have examples xi and correspond ing possible labelings yi,k. We are trying to learn a parametric model p?(yi,k | xi).~~
~~If p?(yi,k | xi) is a ?labeling distribution?~~
~~that reflects our uncertaintyabout the true labels, then our expected negative log likelihood of the model is K def= ? ?~~
~~i ? k p?(yi,k | xi) log p?(yi,k | xi) = ? i ? k p?(yi,k|xi) log p?(yi,k|xi) p?(yi,k|xi)p?(yi,k|xi) = ? i D(p?i?p?,i) + H(p?i) (3) where p?i(?)~~
~~def= p?(?~~
~~| xi) and p?,i(?)~~
~~def= p?(?~~
~~| xi).~~
~~Note that K is a function not only of ? but also 670of the labeling distribution p?; a learner might be al lowed to manipulate either in order to decrease K.The summands of K in equation (3) can be divided into two cases, according to whether xi is la beled or not.~~
~~For the labeled examples {xi : i ? L}, the labeling distribution p?i is a point distribution that assigns all probability to the true, known label y?i .Then H(p?i) = 0.~~
~~The total contribution of these ex amples to K simplifies to ? i?L? log p?(y ? i | xi), i.e., just the negative log-likelihood on the labeled data.But what is the labeling distribution for the unla beled examples {xi : i 6?~~
~~L}?~~
~~Abney simply uses a uniform distribution over labels (e.g., parses), toreflect that the label is unknown.~~
~~If his bootstrap ping algorithm ?labels?~~
~~xi, then i moves into L and H(p?i) is thereby reduced from maximal to 0.~~
~~As aresult, a method that labels the most confident ex amples may reduce K, and Abney shows that his method does so.Our approach is different: we will take the label ing distribution p?i to be our actual current belief p?,i, and manipulate it through changing ? ratherthan L. L remains the original set of supervised ex amples.~~
~~The total contribution of the unsupervised examples to K then simplifies to ? i6?L H(p?,i).We have no reason to believe that these two con tributions (supervised and unsupervised) should be weighted equally.~~
~~We thus introduce a multiplier ?to form the actual objective function that we mini mize with respect to ?:4 ? ?~~
~~i?L log p?,i(y ? i ) + ? N? i6?L H(p?,i) (4) One may regard ? as a Lagrange multiplier that is used to constrain the classifier?s uncertainty H tobe low, as presented in the work on entropy regular ization (Brand, 1999; Grandvalet and Bengio, 2005; Jiao et al, 2006).~~
~~Conventional bootstrapping retrains on the most confident unsupervised examples, making them 4This function is not necessarily convex in ?, because of the addition of the entropy term (Jiao et al, 2006).~~
~~One might try an annealing strategy: start ? at zero (where the function is convex) and gradually increase it, hoping to ?ride?~~
~~the global maximum.~~
~~Although we could increase ? until the entropy term dominates the minimizations and we approach a completely deterministic classifier, it is preferable to use some labeled heldout data to evaluate a stopping criterion.~~
~~more confident.~~
~~Gradient descent on equation (4)essentially does the same, since unsupervised exam ples contribute to (4) only through H , and the shapeof the H function means that it is most rapidly de creased by making the most confident unsupervised examples more confident.~~
~~Besides favoring models that are self-confident onthe unlabeled data, the objective function (4) also ex plicitly asks the model to continue to get the correctanswers on the initial supervised corpus.~~
~~1/?~~
~~con trols the strength of this request.~~
~~One could obtaina similar effect in conventional bootstrapping by up weighting the initial labeled corpus when retraining.~~
~~2.5 Online Learning.~~
~~Minimizing equation (4) for parsing is more computationally intensive than in many other applications of bootstrapping, such as word sense disam biguation or document classification.~~
~~With millions of features, our objective could take many iterations to converge to a local optimum, if we were only to update our parameter vector ? after each iteration through a large unsupervised corpus.~~
~~For many machine learning problems over largedatasets, online learning methods such as stochas tic gradient descent (SGD) have been empirically observed to converge in fewer iterations (Bottou,2003).~~
~~In SGD, instead of taking an optimiza tion step in the direction of the gradient calculated over all unsupervised training examples, we parse each example, calculate the gradient of the objective function evaluated on that example alone, and then take a small step downhill.~~
~~The update rule is thus ?(t+1) ? ?(t) ? ?~~
~~?F (t)(?(t)) (5) where ?(t) is the parameter vector at time t, F (t)(?)is the objective function specialized to the time-t ex ample, and ? > 0 is a learning rate that we choose.~~
~~We check for convergence after each pass through the example set.~~
~~2.6 Algorithms and Complexity.~~
~~To evaluate equation (4), we need a conditional model of trees given a sentence xi.~~
~~We define one by exponentiating and normalizing the tree scores: p?,i(yi,k) def= exp( ? e?yi,k f(e) ? ?)/Zi.~~
~~With exponentially many parses of xi, does ourobjective function (4) now have prohibitive com 671 putational complexity?~~
~~The complexity is actually similar to that of the inside algorithm for parsing.~~
~~In fact, the first term of (4) for projective parsing is found by running the O(n3) inside algorithm on supervised data,5 and its gradient is found by thecorresponding O(n3) outside algorithm.~~
~~For nonprojective parsing, the analogy to the inside algo rithm is the O(n3) ?matrix-tree algorithm,?~~
~~which is dominated asymptotically by a matrix determinant(Smith and Smith, 2007; Koo et al, 2007; McDon ald and Satta, 2007).~~
~~The gradient of a determinant may be computed by matrix inversion, so evaluating the gradient again has the same O(n3) complexity as evaluating the function.~~
~~The second term of (4) is the Shannon entropy of the posterior distribution over parses.~~
~~Computing this for projective parsing takes O(n3) time, using adynamic programming algorithm that is closely related to the inside algorithm (Hwa, 2000).6 For non projective parsing, unfortunately, the runtime rises to O(n4), since it requires determinants of n distinctmatrices (each incorporating a log factor in a dif ferent column; we omit the details).~~
~~The gradient evaluation in both cases is again about as expensive as the function evaluation.A convenient speedup is to replace Shannon entropy with Re?nyi entropy.~~
~~The family of Re?nyi en tropy measures is parameterized by ?: R?(p) = 1 1?~~
~~log ( ? y p(y)?~~
~~) (6) In our setting, where p = p?,i, the events y are the possible parses yi,k of xi.~~
~~Observe that under our definition of p, ? y p(y) ? = { ? y exp[ ? e?y f(e) ?~~
~~(??)]}/Z?i . We already have Zi from running the inside algorithm, and we can find the numerator by running the inside algorithm again with ? scaled by ?.~~
~~Thus with Re?nyi entropy, all computationsand their gradients are O(n3)?even in the non projective case.~~
~~Re?nyi entropy is also a theoretically attractive generalization.~~
~~It can be shown that lim??1 R?(p) 5The numerator of p?,i(y?i ) (see definition above) is trivial since y?i is a single known parse.~~
~~But the denominator Zi is a normalizing constant that sums over all parses; it is found by a dependency-parsing variant of the inside algorithm, following (Eisner, 1996).~~
~~6See also (Mann and McCallum, 2007) for similar results on conditional random fields.~~
~~is in fact the Shannon entropy H(p) and thatlim???R?(p) = ? logmaxy p(y), i.e. the nega tive log probability of the modal or ?Viterbi?~~
~~label (Arndt, 2001; Karakos et al, 2007).~~
~~The ? = 2 case, widely used as a measure of purity in decisiontree learning, is often called the ?Gini index.?~~
~~Fi nally, when ? = 0, we get the log of the number of labels, which equals the H(uniform distribution) that Abney used in equation (3).~~
~~For this paper, we performed some initial bootstrapping experiments on small corpora, using the fea tures from (McDonald et al, 2005).~~
~~After discussing experimental setup (?3.1), we look at the correlationof confidence with accuracy and with oracle likeli hood, and at the fine-grained behaviour of models?dependency edge posteriors (?3.2).~~
~~We then com pare our confidence-maximizing bootstrapping to EM, which has been widely used in semi-supervisedlearning (?3.4).~~
~~Section 3.3 presents overall boot strapping accuracy.~~
~~3.1 Experimental Design.~~
~~We bootstrapped non-projective parsers for languages assembled for the CoNLL dependency pars ing competitions (Buchholz and Marsi, 2006).~~
~~We selected German, Spanish, and Czech (Brants et al., 2002; Civit Torruella and Mart??~~
~~Anton??n, 2002; Bo?hmova?~~
~~et al, 2003).~~
~~After removing sentences more than 60 words long, we randomly divided each corpus into small seed sets of 100 and 1000 trees; development and test sets of 200 trees each; and an unlabeled training set from the rest.~~
~~These treebanks contain strict dependency trees, in the sense that their only nodes are the words and a distinguished root node.~~
~~In the Czech dataset, more than one word can attach to the root; also, thetrees in German, Spanish, and Czech may be nonprojective.~~
~~We use the MSTParser implementation described in McDonald et al (2005) for fea ture extraction.~~
~~Since our seed sets are so small, we extracted features from all edges in both the seed and the unlabeled parts of our training data, not just the edges annotated as correct.~~
~~Since this produced many more features, we pruned our features to those with at least 10 occurrences over all edges.~~
~~672 Correlation of 100-tree model 1000-tree model Re?nyi ? Acc.~~
~~Xent.~~
~~Acc.~~
~~Xent.~~
~~(uniform, Abney) 0 -0.254 0.980 -0.180 0.937 .5 -0.256 0.981 -0.203 0.955 (Shannon) 1 -0.260 0.983 -0.220 0.964 (Gini) 2 -0.266 0.985 -0.250 0.977 5 -0.291 0.992 -0.304 0.990 7 -0.301 0.993 -0.341 0.991 (Viterbi) ? -0.317 0.995 -0.326 0.992 Xent.~~
~~-0.391 1.000 -0.410 1.000Table 1: Correlation, on development sentences, of Re?nyi en tropy with model accuracy and with cross-entropy (?Xent.?).Since these are measures of uncertainty, we see a negative correlation.~~
~~As ? increases, we place more confidence in high probability parses and correlate better with accuracy.We used stochastic gradient descent first to min imize equation (4) on the labeled seed sets.~~
~~Thenwe continued to optimize over the labeled and unla beled data together.~~
~~We tested for convergence using accuracy on development data.~~
~~3.2 Empirically Evaluating Entropy.~~
~~Bootstrapping assumes that where the parser is con fident, it tends to be correct.~~
~~Standard bootstrappingmethods retrain directly on confident links; simi larly, our approach tries to make the parser even more confident on those links.~~
~~Is this assumption really true empirically?~~
~~Yes: not only does confidence on unlabeled data correlatewith cross-entropy, but both confidence and cross entropy correlate well with accuracy.~~
~~As we will see, some confidence measures correlate better than others.~~
~~In particular, measures that are more peaked around the one-best prediction of the parser, as in Viterbi re-estimation, perform well.~~
~~If we train a non-projective German parser on small seed sets of 100 and 1000 trees, only, how well does its own confidence predict its performance?~~
~~For 200 points?labeled development sentences?~~
~~we measured the linear correlation of various Re?nyi entropies (6), normalized by sentence length, with tree accuracy (Table 1).~~
~~We also measured how thesenormalized Re?nyi entropies correlate with the pos terior log-probability the model assigns to the true parse (the cross-entropy).~~
~~Since Re?nyi entropy is a measure of uncertainty, we see a negative correlation with accuracy.~~
~~This correlation strengthens as we raise ? to ?, so wemight expect Viterbi re-estimation, or a differen Figure 1: Posterior probability of correct and incorrect edgesin German test data under various models.~~
~~We show the distri bution of posterior probabilities for correct edges, known from an oracle, in black and incorrect edges in gray.~~
~~In the upper row, learning on an initial supervised set raises the posteriorprobability of correct edges while dragging along some incor rect edges.~~
~~In the lower row, we see that adding unlabeled data with R2 entropy continues the pattern of the supervised learner.R?~~
~~(Viterbi) training induces a second mode in correct pos terior probabilities near 1 although it does shift more incorrect edges closer to 1.~~
~~Figure 2: Precision-recall curves for selecting edges according to their posterior probabilities: better bootstrapping puts more area under the curve.tiable objective function with a very high ?, to perform best on held-out data.~~
~~Note also that the crossentropy, which looks at the true labels on the heldout data, does not itself correlate very much better with accuracy than the best unsupervised confi dence measures.~~
~~Finally, we see that Re?nyi entropies with higher ? are more stable: when calculated for amodel trained on more data, they improve their cor relation with accuracy.From tree confidence, we now turn to edge confi dence: what is the posterior probability that a model assigns to each of the n2 edges in the dependency graph?~~
~~Figure 1 shows smoothed histograms of true edges (black) and false edges (gray) in held-out data, according to the posterior probabilities we assign to 673 them.~~
~~Since there are many more false edges, the figures are cropped to zoom in on the distribution of true edges.~~
~~As we start training on the labeled seedset, the posterior probabilities of true edges move to wards one; many false edges also get greater mass, but not to the same extent.~~
~~As we add unlabeled data, we can see the different learning strategies of different confidence measures.~~
~~R2 gradually moves a few true and many fewer false edges towards 1,while R?~~
~~(Viterbi) learning is so confident as to in duce a bimodal distribution in the posteriors of true edges.~~
~~Figure 2 visualizes the same data as four precision-recall curves, which show how noisy thehighest-confidence edges are, across a range of con fidence thresholds.~~
~~Although the very high precision end stays stable after 10 iterations on the seed set, the addition of unlabeled data puts more area under the curve.~~
~~Again, R?~~
~~dominates R2.~~
~~3.3 Bootstrapping Results.~~
~~We performed bootstrapping experiments on the fullCoNLL sets for Czech, German, and Spanish us ing the non-projective model from McDonald et al (2005).~~
~~Performance confirms the results of ouranalysis above (Table 2).~~
~~Adding unlabeled data im proves performance over that of the seed set, withthe exception of the Czech data with R2 bootstrap ping.~~
~~As we saw in ?3.2, bootstrapping with R?~~
~~dominates bootstrapping with R2 confidence.~~
~~For comparison, we also show the results obtained bysupervised training on the combined seed and unlabeled sets.~~
~~Recall that we did not use the tree anno tations to perform feature selection; models trained with only supported features ought to perform better.Although we see statistically significant improve ments (at the .05 level on a paired permutation test),the quality of the parsers is still quite poor, in con trast to other applications of bootstrapping which?rival supervised methods?~~
~~(Yarowsky, 1995).~~
~~Al most certainly, the CoNLL datasets, comprising atmost some tens of thousands of sentences per language, are too small to afford qualitative improve ments.~~
~~Also, at these relatively small training sizes,our preliminary attempts to leverage comparable En glish corpora did not improve performance.~~
~~What features were learned, and how dependent is performance on the seed set?~~
~~We analyzed theperformance of German bootstrapping on a develop % accuracy Seed trees ? = 0 2 ? Czech 100 56.1 54.8 58.3 1000 68.1 68.2 68.2 71468 77.9 ? ?~~
~~German 100 60.9 62.4 65.3 1000 74.6 74.5 75.0 37745 86.0 ? ?~~
~~Spanish 100 63.6 64.1 64.4 2786 76.6 ? ?~~
~~Table 2: Dependency accuracy of the McDonald model on 200test sentences.~~
~~When ? = 0, training only occurs on the super vised seed data.~~
~~As ? increases, we train based on confidence in our model?s analysis of the unlabeled data.~~
~~Boldface results are the best in their rows in a permutation test at the .05 level.~~
~~ment set (Table 3).~~
~~Using the parameters at the last iteration of supervised training on the seed set as abaseline, we tried updating to their bootstrapped val ues the weights of only those features that occurredin the seed set.~~
~~This achieved nearly the same ac curacy as updating all the features.~~
~~As one would expect, using only the non-seed features?~~
~~weightsperforms abysmally.~~
~~This might be the case simply because the seed set is likely to contain fre quently occurring features.~~
~~If, however, we use only the features occurring in an alternate training set of the same size (100 sentences), we get much worseperformance.~~
~~These results indicate that our bootstrapped parser is still heavily dependent on the fea tures that happened to fire in the seed set; we havenot ?forgotten?~~
~~our initial conditions.~~
~~Similar exper iments show that unlexicalized features contribute the most to bootstrapping performance.~~
~~Since in our log-linear models features have been trained to work together, we must not put too much weight onthese ablation results.~~
~~These experiments do, how ever, suggest that bootstrapping improved our resultsby refining the values of known, non-lexicalized fea tures.~~
~~3.4 Comparison with EM.~~
~~Perhaps the most popular statistical method for learning from incomplete data is the EM algorithm (Dempster et al, 1977).~~
~~Since we cannot try EM on McDonald?s conditional model, we ran some pilot experiments using the generative dependency model with valence (DMV) of Klein and Manning (2004).As in their experiments, and unlike the other exper iments in the current paper, we restricted ourselves 674 Updated M feat.~~
~~acc.~~
~~Updated M feat.~~
~~acc.~~
~~all 15.5 64.3 none 0 60.9 seed 1.4 64.1 non-seed 14.1 44.7 non-lexical 3.5 64.4 lexical 12.0 59.9 non-bilex.~~
~~12.6 64.4 bilexical 2.9 61.0 Table 3: Using all features, dependency accuracy on German development data rose to 64.3% on bootstrapping.~~
~~We show the contribution of different feature splits to the performance of this final model.~~
~~For example, although this model was trained by updating all 15.5M feature weights, it performs as well if we then keep only the 1.4M features that appeared at least once in the seed set, zeroing out the weights of the others.~~
~~We do as well as the full feature set if we keep only the 3.5M non-lexicalized features.~~
~~% accuracy train Bulg.~~
~~German Spanish supervised ML 74.2 80.0 71.3 CL 77.5 79.3 75.0 semi- EM 58.6 58.8 68.4 supervised Conf.~~
~~80.0 80.5 76.7 Table 4: Dependency accuracy of the DMV model (Klein and Manning, 2004).~~
~~Maximizing confidence using R1 (Shannon)entropy improved performance over its own conditional like lihood (CL) baseline and over maximum likelihood (ML).~~
~~EM degraded its ML baseline.~~
~~Since these models were only trained and tested on sentences of 10 words or fewer, accuracy is much higher than the full results in Table 2.to sentences of ten words or fewer and to part-ofspeech sequences alone, without any lexical infor mation.~~
~~Since the DMV models projective trees, we ran experiments on three CoNLL corpora that had augmented their primary non-projective parses with alternate projective annotations: Bulgarian (Simov et al, 2005), German, and Spanish.~~
~~We performed supervised maximum likelihood and conditional likelihood estimation on a seed set of 100 sentences for each language.~~
~~These models respectively initialized EM and confidence trainingon unlabeled data.~~
~~We see (Table 4) that EM degrades the performance of its ML baseline.~~
~~Meri aldo (1994) saw a similar degradation over small (and large) seed sets in HMM POS tagging.~~
~~We tried fixing and not fixing the feature expectations on the seed set during EM and show the former, better numbers.~~
~~Confidence maximization improved over both its own conditional likelihood initializer andalso over ML.~~
~~We selected optimal smoothing pa rameters for all models and optimal ?~~
~~(equation (6))and ?~~
~~(equation (4)) for the confidence model on la beled held-out data.~~
~~We hypothesize that qualitatively better bootstrap ping results will require much larger unlabeled datasets.~~
~~In scaling up bootstrapping to larger unlabeled training sets, we must carefully weight trade offs between expanding coverage and introducingnoise from out-of-domain data.~~
~~We could also bet ter exploit the data we have with richer models ofsyntax.~~
~~In supervised dependency parsing, secondorder edge features provide improvements (McDon ald and Pereira, 2006; Riedel and Clarke, 2006); moreover, the feature-based approach is not limited to dependency parsing.~~
~~Similar techniques could score parses in other formalisms, such as CFG or TAG.~~
~~In this case, f extracts features from eachof the derivation tree?s rewrite rules (CFG) or ele mentary trees (TAG).~~
~~In lexicalized formalisms, f will still be able to score lexical dependencies that are implicitly represented in the parse.~~
~~Finally, we want to investigate whether larger training sets willprovide traction for sparser cross-lingual and cross domain features.~~
~~Feature-rich dependency models promise to help bootstrapping by providing many redundant features for the learner, and they can also cleanly incorporate cross-domain and cross-language information.We explored bootstrapping feature-rich non projective dependency parsers for Czech, German, and Spanish.~~
~~Our bootstrapping method maximizes a linear combination of likelihood and confidence.In initial experiments on small datasets, this surpassed EM for training a simple feature-poor gener ative model, and also improved the performance of a feature-rich, conditionally estimated model where EM could not easily have been applied.~~
~~For our models and training sets, more peaked measuresof confidence, measured by Re?nyi entropy, outper formed smoother ones.~~
~~Acknowledgments The authors thank the anonymous reviewers, Noah A. Smith, and Keith Hall for helpful comments, andRyan McDonald for making his parsing code pub licly available.~~
~~This work was supported in part by NSF ITR grant IIS-0313193.~~
~~675~~