We address the problem of content-based image retrieval in the context of complex document images. Complex
documents typically start out on paper and are then electronically scanned. These documents have rich internal
structure and might only be available in image form. Additionally, they may have been produced by a combination
of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual
elements. Large collections of such complex documents are commonly found in legal and security investigations.
The indexing and analysis of large document collections is currently limited to textual features based OCR data
and ignore the structural context of the document as well as important non-textual elements such as signatures,
logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the
inherent complexity of offline handwriting recognition. We address important research issues concerning content-based
document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse
information contained in scanned paper documents we are developing. Such complex document information
processing combines several forms of image processing together with textual/linguistic processing to enable
effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype
automatically generates rich metadata about a complex document and then applies query tools to integrate
the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are
developing a test collection containing millions of document images.
Analysis of large collections of complex documents is an increasingly important need for numerous applications. Complex documents are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. The state of the art today for a large document collection is essentially text search of OCR'd documents with no meaningful use of data found in images, signatures, logos, etc. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are also developing a roughly 42,000,000 page complex document test collection. The collection will include relevance judgments for queries at a variety of levels of detail and depending on a variety of content and structural characteristics of documents, as well as "known item" queries looking for particular documents.
Conference Committee Involvement (2)
Document Recognition and Retrieval XI
21 January 2004 | San Jose, California, United States
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.