0 citations0 references

A Universal Part-of-Speech Tagset

arXiv (Cornell University)2011pp. 2089–2096

Citations Over Time

Slav Petrov, Dipanjan Das, Ryan McDonald

Abstract

To facilitate future research in unsupervised induction of syntactic structure and to standardize best-practices, we propose a tagset that consists of twelve universal part-of-speech categories. In addition to the tagset, we develop a mapping from 25 different treebank tagsets to this universal set. As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common parts-of-speech for 22 different languages. We highlight the use of this resource via two experiments, including one that reports competitive accuracies for unsupervised grammar induction without gold standard part-of-speech tags.

Related Papers

Europarl: A Parallel Corpus for Statistical Machine Translation(2005)
Universal Dependencies v1: A Multilingual Treebank Collection(2016)
The CoNLL 2007 Shared Task on Dependency Parsing(2007)
Universal Dependency Annotation for Multilingual Parsing(2013)
→ Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging(2013)188 cited