A Realistic Dataset for Performance Evaluation of Document Layout Analysis
Citations Over TimeTop 10% of 2009 papers
Abstract
There is a significant need for a realistic dataset on which to evaluate layout analysis methods and examine their performance in detail. This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents. Strong emphasis is placed on comprehensive and detailed representation of both complex and simple layouts, and on colour originals. In-depth information is recorded both at the page and region level. Ground truth is efficiently created using a new semi-automated tool and stored in a new comprehensive XML representation, the PAGE format. The dataset can be browsed and searched via a Web-based front end to the underlying database and suitable subsets (relevant to specific evaluation goals) can be selected and downloaded.
Related Papers
- → Ground truth model, tool, and dataset for layout analysis of historical documents(2015)20 cited
- → Image-based logical document structure recognition(2014)11 cited
- → Logical structure analysis and generation for structured documents: A syntactic approach(2003)17 cited
- → Ground-Truth and Performance Evaluation for Page Layout Analysis of Born-Digital Documents(2014)2 cited
- → Structure Analysis and Generation for Internet Documents(2003)