Paper: Entity Clustering Across Languages

ACL ID N12-1007
Title Entity Clustering Across Languages
Venue Annual Conference of the North American Chapter of the Association for Computational Linguistics
Session Main Conference
Year 2012

Standard entity clustering systems commonly rely on mention (string) matching, syntactic features, and linguistic resources like English WordNet. When co-referent text mentions ap- pear in different languages, these techniques cannot be easily applied. Consequently, we develop new methods for clustering text men- tions across documents and languages simulta- neously, producing cross-lingual entity clusters. Our approach extends standard clustering algo- rithms with cross-lingual mention and context similarity measures. Crucially, we do not as- sume a pre-existing entity list (knowledge base), so entity characteristics are unknown. On an Arabic-English corpus that contains seven dif- ferent text genres, our best model yields a 24.3% F1 gain over the baseline.