Paper: Unsupervised Tokenization for Machine Translation

ACL ID D09-1075
Title Unsupervised Tokenization for Machine Translation
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2009

Training a statistical machine translation starts with tokenizing a parallel corpus. SomelanguagessuchasChinesedonotin- corporate spacing in their writing system, whichcreatesa challengefortokenization. Moreover,morphologicallyrichlanguages such as Korean present an even bigger challenge, since optimal token boundaries for machine translation in these languages are often unclear. Both rule-based solu- tions and statistical solutions are currently used. In this paper, we present unsuper- vised methods to solve tokenization prob- lem. Our methods incorporate informa- tion available from parallel corpus to de- termine a good tokenization for machine translation.