Paper: Crawling microblogging services to gather language-classified URLs. Workflow and case study

ACL ID P13-3002
Title Crawling microblogging services to gather language-classified URLs. Workflow and case study
Venue Annual Meeting of the Association of Computational Linguistics
Session Student Session
Year 2013
Authors

We present a way to extract links from messages published on microblogging platforms and we classify them according to the language and possible relevance of their target in order to build a text cor- pus. Three platforms are taken into con- sideration: FriendFeed, identi.ca and Red- dit, as they account for a relative diver- sity of user profiles and more importantly user languages. In order to explore them, we introduce a traversal algorithm based on user pages. As we target lesser-known languages, we try to focus on non-English posts by filtering out English text. Us- ing mature open-source software from the NLP research field, a spell checker (as- pell) and a language identification sys- tem (langid.py), our case study and our benchmarks give an insight into the linguistic structure of...