0 citations0 references

TopCat: data mining for topic identification in a text corpus

IEEE Transactions on Knowledge and Data Engineering2004Vol. 16(8), pp. 949–964

Citations Over TimeTop 1% of 2004 papers

Chris Clifton, Robert Cooley, Jason D. M. Rennie

Abstract

TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.

Related Papers

→ Hypergraph Modeling(2023)1 cited
→ Decompositions of 3-uniform hypergraph K_v^{(3)} into hypergraph K_4^{(3)}+e(2010)
→ On the Random Greedy $F$-Free Hypergraph Process(2015)
→ Non-uniform Hypergraphs(2020)
On the random greedy F-free hypergraph process(2015)