Linguistic correlates of style
Citations Over TimeTop 10% of 2004 papers
Abstract
The identification of authorship falls into the category of style classification, an interesting sub-field of text categorization that deals with properties of the form of linguistic expression as opposed to the content of a text. Various feature sets and classification methods have been proposed in the literature, geared towards abstracting away from the content of a text, and focusing on its stylistic properties. We demonstrate that in a realistically difficult authorship attribution scenario, deep linguistic analysis features such as context free production frequencies and semantic relationship frequencies achieve significant error reduction over more commonly used "shallow" features such as function word frequencies and part of speech trigrams. Modern machine learning techniques like support vector machines allow us to explore large feature vectors, combining these different feature sets to achieve high classification accuracy in style-based tasks.
Related Papers
- Study on Improved CHI for feature selection in Chinese text categorization(2011)
- → Blog categorization exploiting domain dictionary and dynamically estimated domains of unknown words(2008)8 cited
- → Text categorization algorithms representations based on inductive learning(2010)4 cited
- Feature Selection in Text Categorization(2004)
- Automatic text categorization for patent data(2008)