A Method of Accounting Bigrams in Topic Models
Citations Over TimeTop 10% of 2015 papers
Abstract
The paper describes the results of an empirical study of integrating bigram collocations and similarities between them and unigrams into topic models. First of all, we propose a novel algorithm PLSA-SIM that is a modification of the original algorithm PLSA. It incorporates bigrams and maintains relationships between unigrams and bigrams based on their component structure. Then we analyze a variety of word association measures in order to integrate top-ranked bigrams into topic models. All experiments were conducted on four text collections of different domains and languages. The experiments distinguish a subgroup of tested measures that produce topranked bigrams, which demonstrate significant improvement of topic models quality for all collections, when integrated into PLSASIM algorithm.
Related Papers
- Topic Models: Accounting Component Structure of Bigrams(2015)
- → Discarding impossible events from statistical language models(2000)1 cited
- Back-off bigram을 이용한 대용량 연속어의 화자적응에 관한 연구(2003)
- → A Corpus-Based Study of the Rate of Changes in Frequency of Syntactic Bigrams in English and Russian(2019)