Paper
13 January 2003 Bootstrapping structured page segmentation
Author Affiliations +
Proceedings Volume 5010, Document Recognition and Retrieval X; (2003) https://doi.org/10.1117/12.476058
Event: Electronic Imaging 2003, 2003, Santa Clara, CA, United States
Abstract
In this paper, we present an approach to the bootstrap learning of a page segmentation model. The idea evolves from attempts to segment dictionaries that often have a consistent page structure, and is extended to the segmentation of more general structured documents. In cases of highly regular structure, the layout can be learned from examples of only a few pages. The system is first trained using a small number of samples, and a larger test set is processed based on the training result. After making corrections to a selected subset of the test set, these corrected samples are combined with the original training samples to generate bootstrap samples. The newly created samples are used to retrain the system, refine the learned features and resegment the test samples. This procedure is applied iteratively until the learned parameters are stable. Using this approach, we do not need to initially provide a large set of training samples. We have applied this segmentation to many structured documents such as dictionaries, phone books, spoken language transcripts, and obtained satisfying segmentation performance.
© (2003) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Huanfeng Ma and David Scott Doermann "Bootstrapping structured page segmentation", Proc. SPIE 5010, Document Recognition and Retrieval X, (13 January 2003); https://doi.org/10.1117/12.476058
Lens.org Logo
CITATIONS
Cited by 11 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Associative arrays

Image segmentation

Feature extraction

Optical character recognition

Bismuth

Stochastic processes

Electronic imaging

Back to Top