Program

Preservation and Access: Research and Development

Period of Performance

3/1/2021 - 2/28/2025

Funding Totals

$349,677.00 (approved)
$349,283.00 (awarded)


Unlocking Endangered Language Resources

FAIN: PR-276810-21

George Mason University (Fairfax, VA 22030-4444)
Antonios Anastasopoulos (Project Director: May 2020 to present)

The development of modern Optical Character Recognition and post-correction tools tailored for Indigenous Latin American languages through a multilingual benchmark, software package, web interface, and digitized data to be returned to the Archive of the Indigenous Languages of Latin America (AILLA).

This project will unlock endangered and low-resource language data that have already been collected in the past and are stored in linguistic archives like the Archive of the Indigenous Languages of Latin America (AILLA).  To do so, we will combine modern machine learning tools with linguistic expertise to develop modern Optical Character Recognition and post-correction tools, tailored to the intricacies of these language data.  The result will include a multilingual benchmark, a software package, a web interface, and digitized data that will be returned to AILLA for storage.





Associated Products

Lexically Aware Semi-Supervised Learning for OCR Post-Correction (Article)
Title: Lexically Aware Semi-Supervised Learning for OCR Post-Correction
Author: Shruti Rijhwani
Author: Daisy Rosenblum
Author: Antonios Anastasopoulos
Author: Graham Neubig
Abstract: Much of the existing linguistic data in many languages of the world is locked away in non- digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general- purpose OCR systems on recognition of less- well-resourced languages. However, these methods rely on manually curated post- correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15%–29%, where we find the combination of self-training and lexically aware decoding essential for achieving consistent improvements.
Year: 2021
Primary URL: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00427/108475/Lexically-Aware-Semi-Supervised-Learning-for-OCR
Access Model: open access
Format: Journal
Periodical Title: Transactions of the Association for Computational Linguistics
Publisher: Transactions of the Association for Computational Linguistics 2021

Explorations in Transfer Learning for OCR Post-Correction (Conference Paper/Presentation)
Title: Explorations in Transfer Learning for OCR Post-Correction
Author: Lindia Tjuatja
Author: Shruti Rijhwani
Author: Graham Neubig
Abstract: In this abstract, we explore transfer learning to improve post-correction for optical character recognition (OCR), specifically for documents that contain endangered language texts. We extend an existing OCR post-correction model (Rijhwani et al., 2020) by introducing an additional pretraining step on related data, such as text in a related language or available target endangered language datasets that may differ in orthography. Although cross-lingual transfer is often successful in high-resource settings, our preliminary results show that transferring from data in another language decreases performance for this task. On the other hand, we observe small improvements in performance when transferring from additional target language data.
Date: 11/10/2021
Primary URL: http://www.winlp.org/wp-content/uploads/2021/11/winlp2021_42_Paper.pdf
Conference Name: 5th Widening NLP Workshop

Noisy Parallel Data Alignment (Conference Paper/Presentation)
Title: Noisy Parallel Data Alignment
Author: Ruoyu Xie
Author: Antonios Anastasopoulos
Abstract: An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.
Date: 1/5/2023
Primary URL: https://aclanthology.org/2023.findings-eacl.111/
Primary URL Description: link to ACL Anthology
Conference Name: Findings of European Association for Computational Linguistics (EACL) 2023

PALI: A Language Identification Benchmark for Perso-Arabic Scripts (Conference Paper/Presentation)
Title: PALI: A Language Identification Benchmark for Perso-Arabic Scripts
Author: Sina Ahmadi
Author: Milind Agarwal
Author: Antonios Anastasopoulos
Abstract: The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where “unconventional” writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.
Date: 1/5/2023
Primary URL: https://aclanthology.org/2023.vardial-1.8/
Primary URL Description: ACL Anthology Entry
Conference Name: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

User-Centric Evaluation of OCR Systems for Kwak’wala (Conference Paper/Presentation)
Title: User-Centric Evaluation of OCR Systems for Kwak’wala
Author: Shruti Rijhwani
Author: Daisy Rosenblum
Author: Michayla King
Author: Antonios Anastasopoulos
Author: Graham Neubig
Abstract: There has been recent interest in improving optical character recognition (OCR) for endangered languages, particularly because a large number of documents and books in these languages are not in machine-readable formats. The performance of OCR systems is typically evaluated using automatic metrics such as character and word error rates. While error rates are useful for the comparison of different models and systems, they do not measure whether and how the transcriptions produced from OCR tools are useful to downstream users. In this paper, we present a human-centric evaluation of OCR systems, focusing on the Kwak’wala language as a case study. With a user study, we show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents – a task that is often undertaken by endangered language community members and researchers – by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
Date: 1/3/2023
Primary URL: https://aclanthology.org/2023.computel-1.4/
Primary URL Description: ACL Anthology Link
Conference Name: Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-6)

A Concise Survey of OCR for Low-Resource Languages (Conference Paper/Presentation)
Title: A Concise Survey of OCR for Low-Resource Languages
Author: Milind Agarwal
Author: Antonios Anastasopoulos
Abstract: Modern natural language processing (NLP) techniques increasingly require substantial amounts of data to train robust algorithms. Building such technologies for low-resource languages requires focusing on data creation efforts and data-efficient algorithms. For a large number of low-resource languages, especially Indigenous languages of the Americas, this data exists in image-based non-machine-readable documents. This includes scanned copies of comprehensive dictionaries, linguistic field notes, children’s stories, and other textual material. To digitize these resources, Optical Character Recognition (OCR) has played a major role but it comes with certain challenges in low-resource settings. In this paper, we share the first survey of OCR techniques specific to low-resource data creation settings and outline several open challenges, with a special focus on Indigenous Languages of the Americas. Based on experiences and results from previous research, we conclude with recommendations on utilizing and improving OCR for the benefit of computational researchers, linguists, and language communities.
Date: 06/15/2024
Primary URL: https://aclanthology.org/2024.americasnlp-1.10.pdf
Primary URL Description: ACL Anthology link
Conference Name: 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

Script-Agnosticism and its Impact on Language Identification for Dravidian Languages (Conference Paper/Presentation)
Title: Script-Agnosticism and its Impact on Language Identification for Dravidian Languages
Author: Milind Agarwal
Author: Joshua Otten
Author: Antonios Anastasopoulos
Abstract: Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.
Date: 04/15/2025
Primary URL: https://aclanthology.org/2025.naacl-long.377.pdf
Primary URL Description: ACL Anthology Link
Conference Name: 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies

LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages (Conference Paper/Presentation)
Title: LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages
Author: Milind Agarwal
Author: Md Mahfuz Ibn Alam
Author: Antonios Anastasopoulos
Abstract: Knowing the language of an input text/audio is a necessary first step for using almost every NLP tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, due to lack of data and computational challenges, current systems cannot accurately identify most of the world’s 7000 languages. To tackle this bottleneck, we first compile a corpus, MCS-350, of 50K multilingual and parallel children’s stories in 350+ languages. MCS-350 can serve as a benchmark for language identification of short texts and for 1400+ new translation directions in low-resource Indian and African languages. Second, we propose a novel misprediction-resolution hierarchical model, LIMIT, for language identification that reduces error by 55% (from 0.71 to 0.32) on our compiled children’s stories dataset and by 40% (from 0.23 to 0.14) on the FLORES-200 benchmark. Our method can expand language identification coverage into low-resource languages by relying solely on systemic misprediction patterns, bypassing the need to retrain large models from scratch.
Date: 12/15/2023
Primary URL: https://aclanthology.org/2023.emnlp-main.895.pdf
Primary URL Description: ACL Anthology Link
Conference Name: 2023 Conference on Empirical Methods in Natural Language Processing

Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak’wala Legacy Texts (Conference Paper/Presentation)
Title: Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak’wala Legacy Texts
Author: Milind Agarwal
Author: Daisy Rosenblum
Author: Antonios Anastasopoulos
Abstract: Kwak’wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak’wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-theshelf OCR methods, language identification, and masking to effectively isolate Kwak’wala text, along with post-correction models, to produce a final high-quality transcription.
Date: 02/01/2025
Primary URL: https://computel-workshop.org/wp-content/uploads/2025/05/CEL-8_Proceedings.pdf
Primary URL Description: Workshop Proceedings
Conference Name: Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

AILLA-OCR: A First Textual and Structural Post-OCR Dataset for 8 Indigenous Languages of Latin America (Conference Paper/Presentation)
Title: AILLA-OCR: A First Textual and Structural Post-OCR Dataset for 8 Indigenous Languages of Latin America
Author: Milind Agarwal
Author: Antonios Anastasopoulos
Abstract: It is by now common knowledge in the NLP community that low-resource languages need large-scale data creation efforts and novel contributions in the form of robust algorithms that work in data-scarce settings. Amongst these languages, however, many have a large amount of data, ripe for NLP applications, except that this data exists in image-based formats. This includes scanned copies of extremely valuable dictionaries, linguistic field notes, children’s stories, plays, and other textual material. To extract the text data from these non machinereadable images, Optical Character Recognition (OCR) is the most popular technique, but it has proven to be challenging for low-resource languages because of their unique properties (uncommon diacritics, rare words etc.) and due to a general lack of preserved page-structure in the OCR output. So, to contribute to the reduction of these two big bottlenecks (lack of text data and layout quality), we release the first textual and structural OCR dataset for 8 indigenous languages of Latin America. We hope that our dataset will encourage researchers within the NLP and Computational Linguistics communities to work with these languages.
Date: 03/01/2025
Primary URL: https://computel-workshop.org/wp-content/uploads/2025/05/CEL-8_Proceedings.pdf
Primary URL Description: Conference Proceedings
Conference Name: Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages