Paper: On Using Written Language Training Data For Spoken Language Modeling

ACL ID H94-1016
Title On Using Written Language Training Data For Spoken Language Modeling
Venue Human Language Technologies
Session Main Conference
Year 1994
Authors

We attemped to improve recognition accuracy by reduc- ing the inadequacies of the lexicon and language model. Specifically we address the following three problems: (1) the best size for the lexicon, (2) conditioning written text for spoken language recognition, and (3) using additional training outside the text distribution. We found that in- creasing the lexicon 20,000 words to 40,000 words re- duced the percentage of words outside the vocabulary from over 2% to just 0.2%, thereby decreasing the error rate substantially. The error rate on words already in the vocabulary did not increase substantially. We modified the language model training text by applying rules to sim- ulate the differences between the training text and what people actually said. Finally, we found that using another thr...