Building an Indonesian rule-based part-of-speech tagger
2014pp. 70–73
Citations Over TimeTop 10% of 2014 papers
Abstract
This paper describes work on a part-of-speech tagger for the Indonesian language by employing a rule-based approach. The system tokenizes documents while also considering multi-word expressions and recognizes named entities. It then applies tags to every token, starting from closed-class words to open-class words and disambiguates the tags based on a set of manually defined rules. The system currently obtains an accuracy of 79% on a manually tagged corpus of roughly 250.000 tokens.
Related Papers
- Rule Based Hindi Part of Speech Tagger(2012)
- → Lexicon-assisted tagging and lemmatization in Latin: A comparison of six taggers and two lemmatization models(2015)11 cited
- → Experimental Tagging of the ORAL Series Corpora: Insights on Using a Stochastic Tagger(2015)3 cited
- → udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit(2017)71 cited
- → Morphological Tagging and Lemmatization in the Albanian Language(2021)