Research: Digitization

I have worked extensively all areas of digitization, including founding and overseeing the highest-volume microform digitization center in academia and consulting widely for academic, governmental, and industrial digitization initiatives. I have extensive experience with OCR, combining both open source and commercial OCR algorithms and workflows into best-of-breed solutions. For example, one such workflow processed more than 100,000 pages per day per server, and included automated correction workflows to autonomously detect poor-quality OCR performance, perform adaptive corrective image enhancement, and iteratively retry OCR. I was also technical lead on a component of a project with the United States National Archives and Records Administration (NARA) evaluating future infrastructure to support large-scale digital repositories with archival and preservation as key focus areas.

Digitization is often portrayed as "simply scanning and searching": a document is fed into a scanner and a searchable PDF emerges on the other side, ready for use. In reality, digitization is a highly specialized pipeline that combines the disciplines of digital imaging, data management, cataloging, user interface design, computation, data mining, and data preservation. I've worked extensively in all of the following areas:

  • Digital Imaging. Nearly all volume digitization initiatives rely on consumer digital cameras as their primary imaging unit. A deep understanding of digital imaging systems, optical assemblies, and discrete light sensing is necessary to develop and tune the most accurate digitization workflows.
  • Image Processing. Extensive image processing occurs on digitized imagery both in the hardware of the camera or capture system and in later stages. Understanding the algorithms and hardware that underlay these processes offers the ability to control for their results.
  • Optical Character Recognition (OCR). Once textual materials have been digitized, they must be converted from their raster imagery into computer-recognizable text.
  • Metadata / Curation. Extensive metadata must be associated with all digital imagery, connecting it to information about provenance and content focus.
  • Access / Data Mining / Search. Access metaphors to digital information must be heavily tailored to the intended user community for a given digital library or repository, matching their unique expectations, access patterns, and flows.
  • Archival / Preservation. File formats continually evolve, and repositories must be able to continually and automatically upgrade content to the latest formats to ensure continued availability, while archiving previous versions and mitigating "experience loss" from version to version.

In 2008 I authored the first technical comparison of the Google Books and Open Content Alliance book digitization programs, which contradicted many of the popular conceptions of the two services at the time and is widely cited in discussions of the two services and large-scale book digitization.