Paper: Towards Spoken-Document Retrieval For The Internet: Lattice Indexing For Large-Scale Web-Search Architectures

ACL ID N06-1053
Title Towards Spoken-Document Retrieval For The Internet: Lattice Indexing For Large-Scale Web-Search Architectures
Venue Human Language Technologies
Session Main Conference
Year 2006
Authors

Large-scale web-search engines are generally designed for linear text. The linear text repre- sentation is suboptimal for audio search, where accuracy can be significantly improved if the search includes alternate recognition candi- dates, commonly represented as word lattices. This paper proposes a method for indexing word lattices that is suitable for large-scale web-search engines, requiring only limited code changes. The proposed method, called Time-based Merging for Indexing (TMI), first converts the word lattice to a posterior-probability represen- tation and then merges word hypotheses with similar time boundaries to reduce the index size. Four alternative approximations are pre- sented, which differ in index size and the strict- ness of the phrase-matching constraints. Results are ...