Program

Preservation and Access: Research and Development

Period of Performance

1/1/2019 - 12/31/2020

Funding Totals

$75,000.00 (approved)
$71,161.38 (awarded)


Development of Image-to-text Conversion for Pashto and Traditional Chinese

FAIN: PR-263939-19

Arizona Board of Regents (Tucson, AZ 85721-0073)
Marek R. Rychlik (Project Director: June 2018 to May 2022)

The development of optical character recognition (OCR) technology and a software prototype for an open-source global language and culture databank for Pashto and Traditional Chinese.

The proposed NEH Research and Development Tier 1 project will provide a foundation for a large-scale, open source, global language and culture data bank for Pashto and Traditional Chinese. The Tier 1 activities include: fundamental research, building a software prototype and formulating a plan for Tier 2. The most important outcome of the Tier 1 phase will be software implementing new optical character recognition (OCR) technology for the two languages. The expected outcome of the entire project will be improved access and preservation of documents in Pashto and Traditional Chinese, collectively representing the cultural heritage of hundreds of millions of people, which will have a major impact on research in the humanities.





Associated Products

Multi-lingual Optical Character Recognition Seminar (Conference/Institute/Seminar)
Title: Multi-lingual Optical Character Recognition Seminar
Author: Marek Rychlik
Author: Yan Han
Abstract: This seminar is devoted to current OCR research and development of "Worldly OCR" software. It is open to external speakers. We are set up for Zoom presentations. Volunteering to give a presentation is welcome.
Date Range: 2019,2020
Primary URL: http://alamos.math.arizona.edu/ocr

worldly-ocr (Web Resource)
Title: worldly-ocr
Author: Marek Rychlik
Author: Sayyed Vazirizade
Author: Yan Han
Author: Dylan Murphy
Author: Dwight Nweigwe
Abstract: Data and MATLAB code for a new OCR system
Year: 2018
Primary URL: https://github.com/mrychlik/worldly-ocr
Primary URL Description: The website contains the data and MATLAB code produced by the project and will be the primary dissemination site for the software products resulting from the project.

Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese (Web Resource)
Title: Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese
Author: Marek Rychlik
Author: Dwight Nweigwe
Author: Yan Han
Author: Dylan Murphy
Abstract: We report upon the results of a research and prototype building project \emph{Worldly~OCR} dedicated to developing new, more accurate image-to-text conversion software for several languages and writing systems. These include the cursive scripts Farsi and Pashto, and Latin cursive scripts. We also describe approaches geared towards Traditional Chinese, which is non-cursive, but features an extremely large character set of 65,000 characters. Our methodology is based on Machine Learning, especially Deep Learning, and Data Science, and is directed towards vast quantities of original documents, exceeding a billion pages. The target audience of this paper is a general audience with interest in Digital Humanities or in retrieval of accurate full-text and metadata from digital images.
Year: 2020
Primary URL: https://arxiv.org/abs/2005.08650
Primary URL Description: A version of the "white paper" on a widely known Cornell e-print server.

Marek Rychlik's YouTube channel (Web Resource)
Title: Marek Rychlik's YouTube channel
Author: Marek Rychlik
Abstract: The channel features approximately 20 videos created by the software produced by the project, visualizing the algorithms developed by the project.
Year: 2019
Primary URL: https://www.youtube.com/channel/UCcq2ciH_Eb0rDckJmS_p-XQ
Primary URL Description: The videos are a mix of highly technical and easy to understand visualizations, illustrating character recognition, the method of outlines, etc. One video entitled "Image-to-text conversion for Farsi" illustrates the operation of a full implementation of the OCR pipeline on Farsi text (used as a proxy for Pashto, as at the time of creation we did not have the Pashto training data yet). The video demonstrates 97-98% accuracy on the level of characters.