Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish
Frontiers in artificial intelligence and applications2014pp. 184–191
Citations Over TimeTop 10% of 2014 papers
Abstract
In this paper, we report on the development of a large-scale Finnish Internet parsebank, currently consisting of 1.5 billion tokens in 116 million sentences. The data is fully morphologically and syntactically analyzed and it has been used to extract flat and syntactic n-gram collections, as well as verb-argument and noun-argument n-grams. Additionally, distributional vector space representations of the words are induced using the word2vec method. All n-gram collections as well as the vector space models are made available under an open license.
Related Papers
- → Automatic Synonym Acquisition Using a Context-Restricted Skip-gram Model(2017)2 cited
- → Experimental Study of Higher-gram Index Length for N-gram Full Text Search System(2006)
- NGRAM: Stata module to provide n-gram feature extractor(2018)
- → ULC Series gram cells from Interface offer high accuracy at low capacities(2000)