PDF Research

Scanning in a Production Imaging Environment

Written and Produced by The Rheinner Group


The Document Capture Process

1. Document Preparation: The first step is document preparation for scanning. There are two basic kinds of preparation:

  • Physical Preparation: locating pages to be scanned, moving them, repairing torn pages, removing staples, rivets and other fastening devices, sorting, labeling, etc.;
  • Batch Preparation: grouping documents by year, type, case, instance, occurrence, or whatever the sorting demands of the application may be.

    Proper preparation can eliminate numerous errors and save valuable time by avoiding rescanning. For example, a daily capture load of 10,000 documents that must be processed within an 8 hour period so that they are available to the imaging system the next day requires that effective capture rate equal 20 ppm. Stopping the scanning process for a five minute jam repair caused by a loose staple or a stack which was inadvertently placed wrong-side down into the ADF places the scanning process off track by 100 documents. Spending more time on document preparation can eliminate much of this work slow down. However, it is unreasonable to assume that these problems will never occur and they should be considered when planning for subsystem capacity.

2. Scanning: A scanner throughput estimate can be achieved by testing the scanner with the types of documents normally scanned. A number of things can be done to speed up scanning. Since scanners work faster in landscape format than they do in the normal lengthwise or portrait mode, a common technique is to feed documents in landscape and use the scanning software to batch rotate them back to their normal orientation after they have been scanned. For high-volume systems, a twin backup should be in operation in order to avoid the delays caused by jams or problem documents on the main scanning source. When not in use for jams or paper problems, the twin can be used to perform rescans requested by quality control so that the normal production scanning process will continue uninterrupted.

2a. Quality Control: Quality control is the inspection of the document to make sure that the image controls set for the scanner have produced an image of acceptable quality. The quality control person will be able to make some adjustments to the image, for example deskewing documents that are slightly out of skew or the rotation of upside down documents. However, quality control operators cannot correct a stretched image, low contrast, or incorrect resolution. These documents must be rejected and rescanned with adjusted scanner controls.

3. Indexing: Indexing, the most critical and the time-consuming step in the capture process, identifies the document and its contents to the image management system. An index is akin to the index of books found in a library and serves a similar purpose. Someone looking for a book first goes to the index to look up the book on the basis of author, publisher, date of publication or subject. The index card then reveals where the book is kept in the library shelves. The process for indexing images is similar, since the images themselves are kept as files on very high capacity storage devices and therefore can not be easily found without an index. A document image index typically has two kinds of indexes:

  • A system level index is an index automatically produced during the capture process. It may include information about the type of document captured, the date and time it was captured and other preset information. Depending on how the scanning process was organized the batch might be forms from New England, Depositions for case # xx, mortgages from 9/11/94 - 9/17/95, etc.
  • An operational index is the index information required for each document. This would include information that can be read from the image itself, including the customer name, account number, social security number, subject, etc.

    A great deal of thought must be given to the index. As a rule of thumb, it should not be excessively long but should be highly accurate to clearly identify the document. In the modern world of databases, the tendency is to overindulge with fields, keywords and complicated queries. These are temptations that should be avoided at all costs in document imaging. Lengthy indexes take a long time to complete, which is of considerable significance to the document capture process in a production environment. In addition, the greater the number of index fields, the greater the possibility of errors during data entry.

3a. Data Extraction: Data can be extracted from images either automatically or manually. The manual process involves a human operator who reads the image and fills out an attached set of database fields. This is how most indexes are created. Other information, irrelevant for the index may also be required to be extracted at this time. Information such as amount paid, customer address, or bank account numbers may be extracted from the image and placed into a database other than the index database.

Depending on the type of document it may make the most sense to automatically extract the information. Forms or other highly-structured documents are well suited to automatic recognition, leaving the operator free to process only those images not readable by the OCR engine. The OCR engine may also be instructed to look in certain areas of the form in order to find, read and transfer the information into a corresponding index field.

A number of third party products make the form recognition process a highly automated proposition, allowing for the ability to focus in on zones within the document, to read barcodes on the document, to recognize mark sense or magnetic ink and then automatically process the form accordingly. Form processing software is a rapidly advancing field and boasts many advantages for production imaging system. Unfortunately given the comprehensiveness of the subject, we cannot possibly do the subject justice in this guide.

Designing forms with the imaging system and OCR engine in mind yields the greatest gains in performance, allowing, for example, the use of color filters in the scanner to eliminate unnecessary elements of the form to present the OCR engine with the cleanest image possible.

4. Information Release: Information release is the last step of the capture process. Once the document has been prepared, delivered to the scanner, scanned, extracted, indexed, and through quality control, it is ready to be released to the image system. Depending on the application, there may be no further use for the image once the data has been extracted, in which case it will simply be deleted. The more likely scenario is that the image will need to be available for others in the business process or will need to be archived at this point. In either case it can now be made available to the rest of the system.

The net effective throughput of the image system can be determined from the length of time required for each of the above steps. Obviously considerable planning, technical assistance and performance is required to reduce the steps to a minimum commitment of time and resources.


Return to PDF Research Companion home page.
a production of Performance Graphics
©1998 The Miller De Wulf Corporation