Program

Digital Humanities: Digital Humanities Advancement Grants

Period of Performance

1/1/2021 - 6/30/2023

Funding Totals

$324,571.00 (approved)
$282,905.00 (awarded)


Automatic Collation for Diversifying Corpora: Improving Handwritten Text Recognition (HTR) for Arabic-script Manuscripts

FAIN: HAA-277203-21

University of Maryland, College Park (College Park, MD 20742-5141)
Matthew Thomas Miller (Project Director: June 2020 to present)
David Smith (Co Project Director: November 2020 to present)

Refinement of machine learning methods to improve automatic handwritten text recognition of Persian and Arabic manuscripts and make these sources more accessible for humanities research and teaching.

The Automatic Collation for Diversifying Corpora (ACDC) project will significantly improve the accuracy of handwritten text recognition (HTR) for Arabic-script manuscripts by developing a collation tool to automatically create large amounts of training data from existing digital texts and manuscript images without time-consuming human annotation of individual manuscripts. The ACDC project will accomplish this task by extending the capabilities of the text alignment tool passim and the HTR engine Kraken to align very poor initial HTR transcriptions of diverse manuscript exemplars with existing digital texts in order to automatically produce training data in a “distantly supervised” manner. The ACDC tool’s acceleration of the training data production process will enable, for the first time, the creation of generalizable Arabic and Persian HTR models required for the digital transcription of large-scale Persian and Arabic manuscript collections.