Paper
29 January 2007 Online medical journal article layout analysis
Jie Zou, Daniel Le, George R. Thoma
Author Affiliations +
Proceedings Volume 6500, Document Recognition and Retrieval XIV; 65000V (2007) https://doi.org/10.1117/12.704434
Event: Electronic Imaging 2007, 2007, San Jose, CA, United States
Abstract
We describe a physical and logical layout analysis algorithm, which is applied to segment and label online medical journal articles (regular HTML and PDF-Converted-HTML files). For these articles, the geometric layout of the Web page is the most important cue for physical layout analysis. The key to physical layout analysis is then to render the HTML file in a Web browser, so that the visual information in zones (composed of one or a set of HTML DOM nodes), especially their relative position, can be utilized. The recursive X-Y cut algorithm is adopted to construct a hierarchical zone tree structure. In logical layout analysis, both geometric and linguistic features are used. The HTML documents are modeled by a Hidden Markov Model with 16 states, and the Viterbi algorithm is then used to find the optimal label sequence, concluding the logical layout analysis.
© (2007) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jie Zou, Daniel Le, and George R. Thoma "Online medical journal article layout analysis", Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000V (29 January 2007); https://doi.org/10.1117/12.704434
Lens.org Logo
CITATIONS
Cited by 6 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visualization

Resistance

Feature extraction

Autoregressive models

Crystals

Image segmentation

Information visualization

Back to Top