Program

Preservation and Access: Research and Development

Period of Performance

3/1/2021 - 2/28/2025

Funding Totals

$349,677.00 (approved)
$349,283.00 (awarded)


Unlocking Endangered Language Resources

FAIN: PR-276810-21

George Mason University (Fairfax, VA 22030-4444)
Antonios Anastasopoulos (Project Director: May 2020 to present)

The development of modern Optical Character Recognition and post-correction tools tailored for Indigenous Latin American languages through a multilingual benchmark, software package, web interface, and digitized data to be returned to the Archive of the Indigenous Languages of Latin America (AILLA).

This project will unlock endangered and low-resource language data that have already been collected in the past and are stored in linguistic archives like the Archive of the Indigenous Languages of Latin America (AILLA).  To do so, we will combine modern machine learning tools with linguistic expertise to develop modern Optical Character Recognition and post-correction tools, tailored to the intricacies of these language data.  The result will include a multilingual benchmark, a software package, a web interface, and digitized data that will be returned to AILLA for storage.





Associated Products

Lexically Aware Semi-Supervised Learning for OCR Post-Correction (Article)
Title: Lexically Aware Semi-Supervised Learning for OCR Post-Correction
Author: Shruti Rijhwani
Author: Daisy Rosenblum
Author: Antonios Anastasopoulos
Author: Graham Neubig
Abstract: Much of the existing linguistic data in many languages of the world is locked away in non- digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general- purpose OCR systems on recognition of less- well-resourced languages. However, these methods rely on manually curated post- correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15%–29%, where we find the combination of self-training and lexically aware decoding essential for achieving consistent improvements.
Year: 2021
Primary URL: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00427/108475/Lexically-Aware-Semi-Supervised-Learning-for-OCR
Access Model: open access
Format: Journal
Periodical Title: Transactions of the Association for Computational Linguistics
Publisher: Transactions of the Association for Computational Linguistics 2021

Explorations in Transfer Learning for OCR Post-Correction (Conference Paper/Presentation)
Title: Explorations in Transfer Learning for OCR Post-Correction
Author: Lindia Tjuatja
Author: Shruti Rijhwani
Author: Graham Neubig
Abstract: In this abstract, we explore transfer learning to improve post-correction for optical character recognition (OCR), specifically for documents that contain endangered language texts. We extend an existing OCR post-correction model (Rijhwani et al., 2020) by introducing an additional pretraining step on related data, such as text in a related language or available target endangered language datasets that may differ in orthography. Although cross-lingual transfer is often successful in high-resource settings, our preliminary results show that transferring from data in another language decreases performance for this task. On the other hand, we observe small improvements in performance when transferring from additional target language data.
Date: 11/10/2021
Primary URL: http://www.winlp.org/wp-content/uploads/2021/11/winlp2021_42_Paper.pdf
Conference Name: 5th Widening NLP Workshop

Noisy Parallel Data Alignment (Conference Paper/Presentation)
Title: Noisy Parallel Data Alignment
Author: Ruoyu Xie
Author: Antonios Anastasopoulos
Abstract: An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.
Date: 1/5/2023
Primary URL: https://aclanthology.org/2023.findings-eacl.111/
Primary URL Description: link to ACL Anthology
Conference Name: Findings of European Association for Computational Linguistics (EACL) 2023

PALI: A Language Identification Benchmark for Perso-Arabic Scripts (Conference Paper/Presentation)
Title: PALI: A Language Identification Benchmark for Perso-Arabic Scripts
Author: Sina Ahmadi
Author: Milind Agarwal
Author: Antonios Anastasopoulos
Abstract: The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where “unconventional” writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.
Date: 1/5/2023
Primary URL: https://aclanthology.org/2023.vardial-1.8/
Primary URL Description: ACL Anthology Entry
Conference Name: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

User-Centric Evaluation of OCR Systems for Kwak’wala (Conference Paper/Presentation)
Title: User-Centric Evaluation of OCR Systems for Kwak’wala
Author: Shruti Rijhwani
Author: Daisy Rosenblum
Author: Michayla King
Author: Antonios Anastasopoulos
Author: Graham Neubig
Abstract: There has been recent interest in improving optical character recognition (OCR) for endangered languages, particularly because a large number of documents and books in these languages are not in machine-readable formats. The performance of OCR systems is typically evaluated using automatic metrics such as character and word error rates. While error rates are useful for the comparison of different models and systems, they do not measure whether and how the transcriptions produced from OCR tools are useful to downstream users. In this paper, we present a human-centric evaluation of OCR systems, focusing on the Kwak’wala language as a case study. With a user study, we show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents – a task that is often undertaken by endangered language community members and researchers – by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
Date: 1/3/2023
Primary URL: https://aclanthology.org/2023.computel-1.4/
Primary URL Description: ACL Anthology Link
Conference Name: Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-6)