Paper: Authorship Attribution and Verification with Many Authors and Limited Data

ACL ID C08-1065
Title Authorship Attribution and Verification with Many Authors and Limited Data
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2008
Authors

Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Most studies also use sizes of training data that are unreal- istic for situations in which stylometry is applied (e.g., forensics), and thereby over- estimate the accuracy of their approach in these situations. A more realistic interpre- tation of the task is as an authorship ver- ification problem that we approximate by pooling data from many different authors as negative examples. In this paper, we show, on the basis of a new corpus with 145 authors, what the effect is of many authors on feature selection and learning, a...