Paper: What's in a p-value in NLP?

ACL ID W14-1601
Title What's in a p-value in NLP?
Venue International Conference on Computational Natural Language Learning
Year 2014

In NLP, we need to document that our pro- posed methods perform significantly bet- ter with respect to standard metrics than previous approaches, typically by re- porting p-values obtained by rank- or randomization-based tests. We show that significance results following current re- search standards are unreliable and, in ad- dition, very sensitive to sample size, co- variates such as sentence length, as well as to the existence of multiple metrics. We estimate that under the assumption of per- fect metrics and unbiased data, we need a significance cut-off at ?0.0025 to reduce the risk of false positive results to <5%. Since in practice we often have consider- able selection bias and poor metrics, this, however, will not do alone.