The past ten years have seen a dramatic rise in digitization efforts in libraries. Despite this widespread interest, digitization is no small task; it requires considerable time and labor—and thus, financial resources—as skilled work is involved at nearly each stage. Unfortunately, institutions are failing to realize the full potential of these investments, chiefly due to the interface design of digital collections, which usually feature keyword search as the primary discovery model.
Keyword search is reliant on both the user’s ability to know from the outset how to describe their query as well as materials’ content or metadata perfectly matching said query. However, materials in digital collections do not easily fit this model. Images do not have textual content, so a keyword search for an image is wholly reliant on its descriptive metadata. And even text-based materials fail on this front, due to the limitations of the optical character recognition (OCR) technologies that enable keyword searching. OCR accuracy ratings can dip under 60% depending on the clarity of the image, the size of the font, the language of the text, and if the text is handwritten. Currently, there is no clear understanding from the user’s perspective of what OCR technology is, how inconsistently it is applied across collections, and how that could affect their search results.
This paper will explore how the prominence of keyword search within digital collections combined with the limitations of OCR have failed users. This paper will include a survey of the current OCR landscape, including its capabilities and limitations. It will also identify issues that should be directly communicated to users in order to increase information literacy. And finally, this paper will explore alternatives to keyword search in digital collections, with the ultimate goal of making digital collections more navigable and useful.