Paper: Scalable Language Processing Algorithms for the Masses: A Case Study in Computing Word Co-occurrence Matrices with MapReduce

ACL ID D08-1044
Title Scalable Language Processing Algorithms for the Masses: A Case Study in Computing Word Co-occurrence Matrices with MapReduce
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2008
Authors
  • Jimmy Lin (University of Maryland, College Park MD)

This paper explores the challenge of scaling up language processing algorithms to increas- ingly large datasets. While cluster comput- ing has been available in commercial environ- ments for several years, academic researchers have fallen behind in their ability to work on large datasets. I discuss two barriers contribut- ing to this problem: lack of a suitable pro- gramming model for managing concurrency and difficulty in obtaining access to hardware. Hadoop, an open-source implementation of Google’s MapReduce framework, provides a compelling solution to both issues. Its simple programming model hides system-level de- tails from the developer, and its ability to run on commodity hardware puts cluster comput- ing within the reach of many academic re- search groups. This paper illustrates...