Paper: Compilation of Specialized Comparable Corpora in French and Japanese

ACL ID W09-3110
Title Compilation of Specialized Comparable Corpora in French and Japanese
Venue Building and Using Comparable Corpora
Session
Year 2009
Authors

We present in this paper the development of a specialized comparable corpora com- pilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: do- main, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detec- tion of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular sci- ence) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both lan- guages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are th...