Paper: Alignment-Based Discriminative String Similarity

ACL ID P07-1083
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2007

A character-based measure of similarity is an important component of many natu- ral language processing systems, including approaches to transliteration, coreference, word alignment, spelling correction, and the identi cation of cognates in related vocabu- laries. We propose an alignment-based dis- criminative framework for string similarity. We gather features from substring pairs con- sistent with a character-based alignment of the two strings. This approach achieves exceptional performance; on nine separate cognate identi cation experiments using six language pairs, we more than double the pre- cision of traditional orthographic measures like Longest Common Subsequence Ratio and Dice’s Coef cient. We also show strong improvements over other recent discrimina- tive and heuristic simila...