Program

Digital Humanities: Digital Humanities Advancement Grants

Period of Performance

1/1/2019 - 6/30/2021

Funding Totals

$100,000.00 (approved)
$99,223.60 (awarded)


Improving Optical Character Recognition and Tracking Reader Annotations in Printed Books by Collating and Transcribing Multiple Exemplars

FAIN: HAA-263837-19

Northeastern University (Boston, MA 02115-5005)
David Smith (Project Director: June 2018 to October 2022)

Further research in enhanced optical character recognition techniques for historical print books and automatic discoverability of handwritten marginalia drawing upon the collections of the Internet Archive.

Most past digitization projects have focused on transcribing documents individually. With the availability of library-scale digital collections, we propose a Digital Humanities Advancement Grant (Level II) to develop computational image and language models to discover multiple copies and editions of similar texts and to correct each text using these comparable witnesses. We provide evidence that this collational transcription system can significantly improve optical character recognition on historical books. We also propose to use these collated editions to discover annotated passages in large digitized book collections. This approach will therefore not only mitigate the errors that reader annotations introduce into the OCR process but will also produce the first automatically generated database of handwritten annotations, Ichneumon. Methods and software developed by this project will thus benefit future research on automatic collation, book history, and historical reading practices.





Associated Products

Alignment-Based Training for Detecting Reader Annotations in Printed Books (Conference Paper/Presentation)
Title: Alignment-Based Training for Detecting Reader Annotations in Printed Books
Author: Soumya Mohanty
Author: David A. Smith
Abstract: Digitized books preserve not only the printed text on the page but also the marks of readers' past engagement. These reader annotations include extra text such as handwritten notes or corrections, underlines, highlights, brackets, delimiters, etc. Building a model that could automatically detect and localize these annotations can allow us to trace which passages interested readers and what additions if any, they made to the text. ... [exceeds 2000 characters]
Date: 5/8/2019
Primary URL: http://datech.digitisation.eu/programme/schedule/
Primary URL Description: conference website
Conference Name: Digital Access to Textual Cultural Heritage (DATeCH)

Detecting de minimis Code-Switching in Historical German Books (Conference Paper/Presentation)
Title: Detecting de minimis Code-Switching in Historical German Books
Author: Shijia Liu
Author: David A. Smith
Abstract: Code-switching has long interested linguists, with computational work in particular focusing on speech and social media data (Sitaram et al., 2019). This paper contrasts these informal instances of code-switching to its appearance in more formal registers, by examining the mixture of languages in the Deutsches Textarchiv (DTA), a corpus of 1406 primarily German books from the 17th to 19th centuries. We automatically annotate and manually inspect spans of six embedded languages (Latin, French, English, Italian, Spanish, and Greek) in the corpus. We quantitatively analyze the differences between code-switching patterns in these books and those in more typically studied speech and social media corpora. Furthermore, we address the practical task of predicting code-switching from features of the matrix language alone in the DTA corpus. Such classifiers can help reduce errors when optical character recognition or speech transcription is applied to a large corpus with rare embedded languages.
Date: 12/8/2020
Primary URL: https://www.aclweb.org/anthology/2020.coling-main.163/
Primary URL Description: Archival version of the paper in the ACL Anthology.
Conference Name: International Conference on Computational Linguistics (COLING)

Content-based models of quotation (Conference Paper/Presentation)
Title: Content-based models of quotation
Author: Ansel MacLaughlin
Author: David A. Smith
Abstract: We explore the task of quotability identification, in which, given a document, we aim to identify which of its passages are the most quotable, i.e. the most likely to be directly quoted by later derived documents. We approach quotability identification as a passage ranking problem and evaluate how well both feature-based and BERT-based (Devlin et al., 2019) models rank the passages in a given document by their predicted quotability. We explore this problem through evaluations on five datasets that span multiple languages (English, Latin) and genres of literature (e.g. poetry, plays, novels) and whose corresponding derived documents are of multiple types (news, journal articles). Our experiments confirm the relatively strong performance of BERT-based models on this task, with the best model, a RoBERTA sequential sentence tagger, achieving an average ρ of 0.35 and NDCG@1, 5, 50 of 0.26, 0.31 and 0.40, respectively, across all five datasets.
Date: 4/19/2021
Primary URL: http://ccs.neu.edu/home/dasmith/maclaughlin-eacl-2021.pdf
Primary URL Description: Preprint
Conference Name: Conference of the European Chapter of the Association for Computational Linguistics (EACL)

Digital Editions as Distant Supervision for Layout Analysis of Printed Books (Conference Paper/Presentation)
Title: Digital Editions as Distant Supervision for Layout Analysis of Printed Books
Author: Alejandro H. Toselli
Author: Si Wu
Author: David A. Smith
Abstract: Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents’ semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
Date: 9/8/2021
Conference Name: International Conference on Document Analysis and Recognition