Research: Digitization
I have worked extensively all areas of digitization, including founding and overseeing the highest-volume microform digitization center in academia and consulting widely for academic, governmental, and industrial digitization initiatives. I have extensive experience with OCR, combining both open source and commercial OCR algorithms and workflows into best-of-breed solutions. For example, one such workflow processed more than 100,000 pages per day per server, and included automated correction workflows to autonomously detect poor-quality OCR performance, perform adaptive corrective image enhancement, and iteratively retry OCR. I was also technical lead on a component of a project with the United States National Archives and Records Administration (NARA) evaluating future infrastructure to support large-scale digital repositories with archival and preservation as key focus areas.
Digitization is often portrayed as "simply scanning and searching": a document is fed into a scanner and a searchable PDF emerges on the other side, ready for use. In reality, digitization is a highly specialized pipeline that combines the disciplines of digital imaging, data management, cataloging, user interface design, computation, data mining, and data preservation. I've worked extensively in all of the following areas:
- Digital Imaging. Nearly all volume digitization initiatives rely on consumer digital cameras as their primary imaging unit. A deep understanding of digital imaging systems, optical assemblies, and discrete light sensing is necessary to develop and tune the most accurate digitization workflows.
- Image Processing. Extensive image processing occurs on digitized imagery both in the hardware of the camera or capture system and in later stages. Understanding the algorithms and hardware that underlay these processes offers the ability to control for their results.
- Optical Character Recognition (OCR). Once textual materials have been digitized, they must be converted from their raster imagery into computer-recognizable text.
- Metadata / Curation. Extensive metadata must be associated with all digital imagery, connecting it to information about provenance and content focus.
- Access / Data Mining / Search. Access metaphors to digital information must be heavily tailored to the intended user community for a given digital library or repository, matching their unique expectations, access patterns, and flows.
- Archival / Preservation. File formats continually evolve, and repositories must be able to continually and automatically upgrade content to the latest formats to ensure continued availability, while archiving previous versions and mitigating "experience loss" from version to version.
In 2008 I authored the first technical comparison of the Google Books and Open Content Alliance book digitization programs, which contradicted many of the popular conceptions of the two services at the time and is widely cited in discussions of the two services and large-scale book digitization.
- Mass book digitization: The deeper story of Google Books and the Open Content Alliance. First Monday. Vol. 13, Issue 10. (October 6, 2008).
- Como Sera el Mundo del Libro en el 2020? Seis expertos predicen el futuro (What will the world of the library look like in 2020? Six experts predict the future). Interview in Que Leer, article by Alvaro Colomer. January 2009 issue, pp. 38-45. (Que Leer is the preeminent Spanish cultural magazine)
- Cited in Library Quarterly: Conway, Paul. (January 2010). Preservation in the Age of Google: Digitization, Digital Preservation, and Dilemmas. Volume 80, Number 1. pp. 61-79.
- Cited in portal: Libraries and the Academy: Dougan, Kirstin. (2009). Music to our Eyes: Google Books, Google Scholar, and the Open Content Alliance. Volume 10, Number 1. pp. 75-93.
- Cited in 10th International Conference on Document Analysis and Recognition: Fan, Jian. Robust Color Image Enhancement of Digitized Books. (July 26-July 29, 2009, Barcelona, Spain).
- Extensively cited and covered in the popular and academic press.