Paper: Discourse Type Clustering using POS n-gram Profiles and High-Dimensional Embeddings

ACL ID E12-3007
Title Discourse Type Clustering using POS n-gram Profiles and High-Dimensional Embeddings
Venue Annual Meeting of The European Chapter of The Association of Computational Linguistics
Session Student Session
Year 2012
Authors

To cluster textual sequence types (discourse types/modes) in French texts, K-means algorithm with high-dimensional embed- dings and fuzzy clustering algorithm were applied on clauses whose POS (part-of- speech) n-gram profiles were previously ex- tracted. Uni-, bi- and trigrams were used on four 19th century French short stories by Maupassant. For high-dimensional embed- dings, power transformations on the chi- squared distances between clauses were ex- plored. Preliminary results show that high- dimensional embeddings improve the qual- ity of clustering, contrasting the use of bi- and trigrams whose performance is disap- pointing, possibly because of feature space sparsity.