Enhancement of unsupervised feature selection for conditional random fields learning in Chinese word segmentation
Citations Over TimeTop 23% of 2011 papers
Abstract
This work proposed a unified view of several unsupervised feature selection based on frequent strings that improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS), term-contributed frequency (TCF), and term-contributed boundary (TCB), with a specific manner of boundary overlapping. For the experiment, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS; and the data set is acquired from SIGHAN CWS bakeoff 2005 and SIGHAN CWS 2010. The experiment results show that all of those features improve the performance of the baseline system in terms of recall, precision, and their harmonic average as F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1 measure score, on both accuracy (F) and out-of-vocabulary recognition (F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">OOV ). In particular, this work presents a novel feature selection approach of the compound feature “AVS+TCB” that outperforms other types of features for CRF-based CSW in terms of F and F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">OOV .
Related Papers
- → Bidirectional integrated random fields for human behaviour understanding(2012)10 cited
- → WNUT 2020 Shared Task-1: Conditional Random Field(CRF) based Named Entity Recognition(NER) for Wet Lab Protocols(2020)5 cited
- Named entity recognition in Chinese medical records based on cascaded conditional random field(2014)
- → Human behavior recognition based on fractal conditional random field(2013)2 cited
- → Biomedical Named Entity Recognition Using Second-Order Conditional Random Fields(2011)