HAA-263837-19 | Digital Humanities: Digital Humanities Advancement Grants | Northeastern University | Improving Optical Character Recognition and Tracking Reader Annotations in Printed Books by Collating and Transcribing Multiple Exemplars | 1/1/2019 - 6/30/2021 | $100,000.00 | David | | Smith | | | | Northeastern University | Boston | MA | 02115-5005 | USA | 2018 | Computational Linguistics | Digital Humanities Advancement Grants | Digital Humanities | 100000 | 0 | 99223.6 | 0 | Further research in enhanced optical character recognition techniques for historical print books and automatic discoverability of handwritten marginalia drawing upon the collections of the Internet Archive.
Most past digitization projects have focused on transcribing documents individually. With the availability of library-scale digital collections, we propose a Digital Humanities Advancement Grant (Level II) to develop computational image and language models to discover multiple copies and editions of similar texts and to correct each text using these comparable witnesses. We provide evidence that this collational transcription system can significantly improve optical character recognition on historical books. We also propose to use these collated editions to discover annotated passages in large digitized book collections. This approach will therefore not only mitigate the errors that reader annotations introduce into the OCR process but will also produce the first automatically generated database of handwritten annotations, Ichneumon. Methods and software developed by this project will thus benefit future research on automatic collation, book history, and historical reading practices. |