Paper: An Empirical Investigation of Statistical Significance in NLP

ACL ID D12-1091
Title An Empirical Investigation of Statistical Significance in NLP
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2012
Authors

We investigate two aspects of the empirical behavior of paired significance tests for NLP systems. First, when one system appears to outperform another, how does significance level relate in practice to the magnitude of the gain, to the size of the test set, to the similar- ity of the systems, and so on? Is it true that for each task there is a gain which roughly implies significance? We explore these issues across a range of NLP tasks using both large collec- tions of past systems? outputs and variants of single systems. Next, once significance lev- els are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed, such as across domains? We explore this question using a r...