TopCat: data mining for topic identification in a text corpus
Citations Over TimeTop 1% of 2004 papers
Abstract
TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.
Related Papers
- → Hypergraph Modeling(2023)1 cited
- → Decompositions of 3-uniform hypergraph K_v^{(3)} into hypergraph K_4^{(3)}+e(2010)
- → On the Random Greedy $F$-Free Hypergraph Process(2015)
- → Non-uniform Hypergraphs(2020)
- On the random greedy F-free hypergraph process(2015)