Bookmark and Share

Research: Digitization

I have worked extensively all areas of digitization, including founding and overseeing the highest-volume microform digitization center in academia and consulting widely for academic, governmental, and industrial digitization initiatives. I have extensive experience with OCR, combining both open source and commercial OCR algorithms and workflows into best-of-breed solutions. For example, one such workflow processed more than 100,000 pages per day per server, and included automated correction workflows to autonomously detect poor-quality OCR performance, perform adaptive corrective image enhancement, and iteratively retry OCR. I was also technical lead on a component of a project with the United States National Archives and Records Administration (NARA) evaluating future infrastructure to support large-scale digital repositories with archival and preservation as key focus areas.

Digitization is often portrayed as "simply scanning and searching": a document is fed into a scanner and a searchable PDF emerges on the other side, ready for use. In reality, digitization is a highly specialized pipeline that combines the disciplines of digital imaging, data management, cataloging, user interface design, computation, data mining, and data preservation. I've worked extensively in all of the following areas:

In 2008 I authored the first technical comparison of the Google Books and Open Content Alliance book digitization programs, which contradicted many of the popular conceptions of the two services at the time and is widely cited in discussions of the two services and large-scale book digitization.