A Comparative Study of Dictionary-based and Machine Learning-based Named Entity Recognition in Pashto
Citations Over TimeTop 22% of 2020 papers
Abstract
Information Extraction (IE) is the process of extracting structured information from unstructured text using natural language processing (NLP). One important sub-task of IE is the extraction of names of persons, places, and organizations, called Named Entity Recognition (NER). NER plays an important role in many NLP applications such as Question Answering, Machine Translation, and Text Summarization. It has been widely studied for high-resource languages like English. However, no research has taken place in this regard for Pashto. We hypothesized that based on the research done for English and other languages in the area of NER a system can be developed for Pashto. We have developed two NER systems for detecting names of persons, places, and organizations in Pashto text. First, a dictionary-based NER that uses three dictionaries containing names of persons, locations, and organizations, respectively. Second, a learning-based approach that uses Hidden Markov Model (HMM) for the task. We have evaluated both systems on a dataset collected from sports news. Our evaluation showed F-Measure of 82% for HMM and 60% for dictionary-based NER. Our findings highlight that HMM outperforms dictionary based NER.
Related Papers
- → Fine-Grained Named Entity Recognition Using Conditional Random Fields for Question Answering(2006)61 cited
- → Biomedical Named Entity Recognition Based on Skip-Chain CRFS(2012)22 cited
- UQAM-NTL: Named entity recognition in Twitter messages.(2016)
- → CRF-Named Entity Recognition Model for Ancient Isan Medicine Texts(2024)1 cited