KEYWORDS: Optical character recognition, Lanthanum, Databases, Visualization, Data modeling, Medicine, Optical inspection, Motion models, Information science, Scientific research
Extraction of metadata from documents is a tedious and expensive process. In general, documents are manually reviewed for structured data such as title, author, date, organization, etc. The purpose of extraction is to build metadata for documents that can be used when formulating structured queries. In many large document repositories such as the National Library of Medicine (NLM)1 or university libraries, the extraction task is a daily process that spans decades. Although some automation is used during the extraction process, generally, metadata extraction is a manual task. Aside from the cost and labor time, manual processing is error prone and requires many levels of quality control. Recent advances in extraction technology, as reported at the Message the Understanding Conference (MUC),2 is comparable with extraction performed by humans. In addition, many organizations use historical data for lookup to improve the quality of extraction. For the large government document repository we are working with, the task involves extraction of several fields from millions of OCR'd and electronic documents. Since this project is time-sensitive, automatic extraction turns out to be the only viable solution. There are more than a dozen fields associated with each document that require extraction. In this paper, we report on the extraction and generation of the title field.
KEYWORDS: Optical character recognition, Information science, Scientific research, Lanthanum, Data storage, Data processing, Internet, Electronic imaging, Databases, Feature extraction
We report on an attempt to build an automatic redaction system by applying information extraction techniques to the identification of private dates of birth. We conclude that automatic redaction is a promising concept although information extraction is significantly affected by the presence of OCR error.
Hundreds of experiments over the last decade on the retrieval of OCR documents performed by the Information Science Research Institute have shown that OCR errors do not significantly affect retrievability. We extend those results to show that in the case of proximity searching, the removal of running headers and footers from OCR text will not improve retrievability for such searches.
For over 10 years, the Information Science Research Institute (ISRI) at UNLV has worked on problems associated with the electronic conversion of archival document collections. Such collections typically have a large fraction of poor quality images and present a special challenge to OCR systems. Frequently, because of the size of the collection, manual correction of the output is not affordable. Because the output text is used only to build the index for an information retrieval (IR) system, the accuracy of non-stopwords is the most important measure of output quality. For these reasons, ISRI has focused on using document level knowledge as the best means of providing automatic correction of non-stopwords in OCR output. In 1998, we developed the MANICURE [1] post-processing system that combined several document level corrections. Because of the high cost of obtaining accurate ground-truth text at the document level, we have never been able to quantify the accuracy improvement achievable using document level knowledge. In this report, we describe an experiment to measure the actual number (and percentage) of non-stopwords corrected by the MANICURE system. We believe this to be the first quantitative measure of OCR conversion improvement that is possible using document level knowledge.
In this paper we describe experiments that investigate the effects of OCR errors on text categorization. In particular, we show that in our environment, OCR errors have no effect on categorization when we use a classifier based on the naive Bayes model. We also observe that dimensionality reduction techniques eliminate a large number of OCR errors and improve categorization results.
KEYWORDS: Optical character recognition, Databases, Prototyping, Legal, Image quality, Diffractive optical elements, Data modeling, Information science, Scientific research
We report on the UNLV-ISRI document collection history, composition, and characteristics. We further provide a short summary of research projects that were conducted using subsets of this collection. These projects were designed to address the retrieval effectiveness from OCR generated collections. Along with this report, ISRI is making this collection available to researchers for further study on the topic of OCR and Information Retrieval.
In this report, we describe the results of an experiment designed to measure the effects of automatic query expansion on retrieval effectiveness. In particular, we used a collection-specific thesaurus to expand the query by adding synonyms of the searched terms. Our preliminary results show no significant gain in average precision and recall.
MANICURE is a document processing system that provides integrated facilities for creating electronic forms of printed materials. In this paper the functionalities supported by MANICURE and their implementations are described. In particular, we provide information on specific modules dealing with automatic detection and correction of OCR errors and automatic markup of logical components of the text. We further show that the various text formats produced by MANICURE can be used by web browsers and/or be manipulated by search routines to highlight the requested information on document images.
One predominant application of OCR is the recognition of full text documents for information retrieval. Modern retrieval systems exploit both the textual content of the document as well as its structure. The relationship between textual content and character accuracy have been the focus of recent studies. It has been shown that due to the redundancies in text, average precision and recall is not heavily affected by OCR character errors. What is not fully known is to what extent OCR devices can provide reliable information that can be used to capture the structure of the document. In this paper, we present a preliminary report on the design and evaluation of a system to automatically markup technical documents, based on information provided by an OCR device. The device we use differs from traditional OCR devices in that it not only performs optical character recognition, but also provides detailed information about page layout, word geometry, and font usage. Our automatic markup program, which we call Autotag, uses this information, combined with dictionary lookup and content analysis, to identify structural components of the text. These include the document title, author information, abstract, sections, section titles, paragraphs, sentences, and de-hyphenated words. A visual examination of the hardcopy is compared to the output of our markup system to determine its correctness.
This paper describes a new expert system for automatically correcting errors made by optical character recognition (OCR) devices. The system, which we call the post-processing system, is designed to improve the quality of text produced by an OCR device in preparation for subsequent retrieval from an information system. The system is composed of numerous parts: an information retrieval system, an English dictionary, a domain-specific dictionary, and a collection of algorithms and heuristics designed to correct as many OCR errors as possible. For the remaining errors that cannot be corrected, the system passes them on to a user-level editing program. This post-processing system can be viewed as part of a larger system that would streamline the steps of taking a document from its hard copy form to its usable electronic form, or it can be considered a stand alone system for OCR error correction. An earlier version of this system has been used to process approximately 10,000 pages of OCR generated text. Among the OCR errors discovered by this version, about 87% were corrected. We implement numerous new parts of the system, test this new version, and present the results.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.