Mr. LDA
Citations Over TimeTop 1% of 2012 papers
Abstract
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. In this paper, we introduce a novel and flexible large scale topic modeling package in MapReduce (Mr. LDA). As opposed to other techniques which use Gibbs sampling, our proposed framework uses variational inference, which easily fits into a distributed environment. More importantly, this variational implementation, unlike highly tuned and specialized implementations based on Gibbs sampling, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: informed priors to guide topic discovery and extracting topics from a multilingual corpus. We compare the scalability of Mr. LDA against Mahout, an existing large scale topic modeling package. Mr. LDA out-performs Mahout both in execution speed and held-out likelihood.
Related Papers
- → Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey(2018)1,777 cited
- → Topic Modeling on News Articles using Latent Dirichlet Allocation(2022)10 cited
- → Automatic Topic Clustering Using Latent Dirichlet Allocation with Skip-Gram Model on Final Project Abstracts(2017)2 cited
- → Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey(2017)164 cited
- → Topic Modelling of Swedish Newspaper Articles about Coronavirus: a Case Study using Latent Dirichlet Allocation Method(2023)2 cited