ACL ID I08-1007
Title Orthographic Disambiguation Incorporating Transliterated Probability
Venue International Joint Conference on Natural Language Processing
Session Main Conference
Year 2008

Orthographic variance is a fundamental problem for many natural language process- ing applications. The Japanese language, in particular, contains many orthographic vari- ants for two main reasons: (1) transliterated words allow many possible spelling varia- tions, and (2) many characters in Japanese nouns can be omitted or substituted. Pre- vious studies have mainly focused on the former problem; in contrast, this study has addressed both problems using the same framework. First, we automatically col- lected both positive examples (sets of equiv- alenttermpairs)andnegativeexamples(sets of inequivalent term pairs). Then, by using both sets of examples, a support vector ma- chine based classifier determined whether two terms (t1 and t2) were equivalent. To boost accuracy, we added a transl...