Arabic tokenization system
2007pp. 65–65
Citations Over TimeTop 12% of 2007 papers
Abstract
Tokenization is a necessary and non-trivial step in natural language processing. In the case of Arabic, where a single word can comprise up to four independent tokens, morphological knowledge needs to be incorporated into the tokenizer. In this paper we describe a rule-based tokenizer that handles tokenization as a full-rounded process with a preprocessing stage (white space normalizer), and a post-processing stage (token filter). We also show how it handles multiword expressions, and how ambiguity is resolved.
Related Papers
- UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing(2016)
- → Italian Text Categorization with Lemmatization and Support Vector Machines(2019)15 cited
- → udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit(2017)71 cited
- Efficient Natural Language Processing with Python(2015)
- Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit [R package udpipe version 0.8.6](2021)