The NIST machine-printed database which was formerly part of the Special Databases collection contains gray scale and binary images of machine printed pages. The database was previously known as Special Database 8.
There was a total of 3,063,168 characters in the set which is an average of 8509 characters per page.
A reference file was included for each page. These reference files are the ASCII text pages that were used to generate the original hardcopy that was digitized.
This database was being distributed for use in the development and testing of Optical Character Recognition (OCR) systems on a common set of images. This allowed vendors to report results with respect to this common image set.
The database had the following features:
Suitable for automated machine-print research, development, and evaluation, the data set can be used for:
The database was a valuable tool for measurement and comparison of system performance on machine-print pages.
The contact for this database is:
Karen Marshall
National Institute of Standards and Technology
100 Bureau Drive, Stop 8940
Gaithersburg, MD 20899-8940
Phone: (301) 975-8296
karen.marshall [at] nist.gov (karen[dot]marshall[at]nist[dot]gov) (link sends e-mail)
Keywords: ASCII Reference, automated character recognition, automated data capture, binary, character recognition, font size, full page, Grayscale Image Database, machine print, NIST, OCR, optical character recognition, software recognition, style.