home Award Search arrow_forward_ios HK-230965-15

Program

Digital Humanities: Digital Humanities Implementation Grants

Period of Performance

9/1/2015 - 12/31/2017

Funding Totals

$215,830.00 (approved)
$215,591.34 (awarded)

Reading the First Books: Multilingual, Early-Modern OCR for Primeros Libros

FAIN: HK-230965-15

University of Texas at Austin (Austin, TX 78712-0100)
Sergio Romero (Project Director: February 2015 to June 2019)
Laura C. Mandell (Co Project Director: July 2015 to June 2019)

Enhancement of optical character recognition (OCR) technologies to improve researchers' ability to discover and search early modern, multilingual printed texts. During this phase, the project team would focus on books printed in the Americas before 1601.

Digital facsimile collections of early modern printed books (books printed on hand presses in the 15th-17th century) greatly improve access to these cultural heritage materials for scholars, students, and the general public. The utility and accessibility of these digital collections, however, has been limited by the challenges of transcribing early modern printed books: their linguistic complexity, unstable orthography (spelling and punctuation), and uneven typesetting and inking make these books difficult to read for humans and machines alike. The goal of this project is to develop and implement groundbreaking methods in the automatic transcription of early modern printed books. This will increase access to books that are not just a vital record of historical thought during this exciting period in European, colonial, and indigenous American history, but also reflect the development of a new, transformative technology - the printing press.

Associated Products

Reading the First Books (Public Lecture or Presentation)
Title: Reading the First Books
Abstract: New projects to scan early modern printed books have radically increased global access to valuable historical documents. Machine readers, however, are woefully unsuited to the uneven inking, anachronistic characters, unfamiliar typefaces, inconsistent orthographies, and multilinguality that characterize these historical documents. The “Reading the First Books” project addresses these challenges through the development and implementation of Ocular, a new digital tool for reading and automatically transcribing books from this period. The project focuses on the tailoring the tool for reading the Primeros Libros collection of books printed in the Americas during the first century of Spanish colonization. This talk will introduce the practical and theoretical implications of this project for both librarians and scholars interested in colonial documents or cultural analytics. On a practical level, we will discuss how the integration of Ocular into the Early Modern OCR Project at Texas A&M will enable new transcription projects across institutions. From a theoretical position, we will consider how Ocular’s transcription process, which simultaneously analyzes patterns in inking, typography, language use, and orthography, opens new possibilities for academic research. Reading the First Books is a collaboration between LLILAS Benson Latin American Studies and Collections at the University of Texas at Austin, and the Initiative for Digital Humanities, Media, and Culture at Texas A&M University. It is funded by an NEH Digital Humanities Implementation Grant.
Author: Hannah Alpert-Abrams
Date: 11/18/15
Location: John Carter Brown Library

The Electronic Edition of Colonial and Nineteenth-Century Latin American Texts: New Tools, New Models for Collaboration (Conference Paper/Presentation)
Title: The Electronic Edition of Colonial and Nineteenth-Century Latin American Texts: New Tools, New Models for Collaboration
Author: Hannah Alpert-Abrams
Abstract: This session brings together a group of experts for a conversation about new possibilities for digital research related to colonial and nineteenth-century Latin America. Hannah Alpert-Abrams of the University of Texas at Austin will speak on Ocular, an optical character recognition (OCR) tool that can read multilingual texts, including those involving indigenous languages. Nick Laiacona, founder of Performant Software Solutions, will discuss Juxta, a TEI-XML-based editing tool that provides an easy-to-use graphical interface and features for project management, including version control. Liz Grumbach, project manager for the Advanced Research Consortium and 18thConnect, will share her experiences creating communities to support the peer-review of electronic scholarship. Ralph Bauer of the University of Maryland will discuss the changes that are taking place at the Early Americas Digital Archive. This session has been designed as the starting point for what we hope will be an ongoing conversation about the Digital Humanities in our field.
Date: 05/28/2016
Conference Name: Latin American Studies Association

An Unsupervised Model of Orthographic Variation for Historical Document Transcription (Article)
Title: An Unsupervised Model of Orthographic Variation for Historical Document Transcription
Author: Dan Garrette and Hannah Alpert-Abrams
Abstract: Historical documents frequently exhibit extensive orthographic variation, including archaic spellings and obsolete shorthand. OCR tools typically seek to produce so-called diplomatic transcriptions that preserve these variants, but many end tasks require transcriptions with normalized orthography. In this paper, we present a novel joint transcription model that learns, unsupervised, a probabilistic mapping between modern orthography and that used in the document. Our system thus produces dual diplomatic and normalized transcriptions simultaneously, and achieves a 35% relative error reduction over a state-of-the-art OCR model on diplomatic transcription, and a 46% reduction on normalized transcription.
Year: 2016
Primary URL: http://naacl.org/naacl-hlt-2016/proceedings.html
Primary URL Description: Official site of publication.
Secondary URL: http://www.dhgarrette.com/papers/garrette_ocr_naacl2016.pdf
Secondary URL Description: PDF of article on personal website.
Access Model: Open Access
Format: Journal
Publisher: North American Association of Computational Linguistics

Machine Reading the Primeros Libros (Article)
Title: Machine Reading the Primeros Libros
Author: Hannah Alpert-Abrams
Abstract: Early modern printed books pose particular challenges for automatic transcription: uneven inking, irregular orthographies, radically multilingual texts. As a result, modern efforts to transcribe these documents tend to produce the textual gibberish commonly known as "dirty OCR" (Optical Character Recognition). This noisy output is most frequently seen as a barrier to access for scholars interested in the computational analysis or digital display of transcribed documents. This article, however, proposes that a closer analysis of dirty OCR can reveal both historical and cultural factors at play in the practice of automatic transcription. To make this argument, it focuses on tools developed for the automatic transcription of the Primeros Libros collection of sixteenth century Mexican printed books. By bringing together the history of the collection with that of the OCR tool, it illustrates how the colonial history of these documents is embedded in, and transformed by, the statistical models used for automatic transcription. It argues that automatic transcription, itself a mechanical and practical tool, also has an interpretive effect on transcribed texts that can have practical consequences for scholarly work.
Year: 2016
Primary URL: http://www.digitalhumanities.org/dhq/vol/10/4/000268/000268.html
Primary URL Description: Journal Website
Access Model: Open Access
Format: Journal
Periodical Title: Digital Humanities Quarterly