Development of Image-to-text Conversion for Pashto and Traditional Chinese
FAIN: PR-263939-19
Arizona Board of Regents (Tucson, AZ 85721-0073)
Marek R. Rychlik (Project Director: June 2018 to May 2022)
The development of optical character recognition
(OCR) technology and a software prototype for an open-source global language
and culture databank for Pashto and Traditional Chinese.
The proposed NEH Research and
Development Tier 1 project will provide a foundation for a large-scale, open
source, global language and culture data bank for Pashto and Traditional
Chinese. The Tier 1 activities include: fundamental research, building a
software prototype and formulating a plan for Tier 2. The most important
outcome of the Tier 1 phase will be software implementing new optical character
recognition (OCR) technology for the two languages. The expected outcome of the
entire project will be improved access and preservation of documents in Pashto
and Traditional Chinese, collectively representing the cultural heritage of
hundreds of millions of people, which will have a major impact on research in
the humanities.
Associated Products
Multi-lingual Optical Character Recognition Seminar (Conference/Institute/Seminar)Title: Multi-lingual Optical Character Recognition Seminar
Author: Marek Rychlik
Author: Yan Han
Abstract: This seminar is devoted to current OCR research and development of "Worldly OCR" software. It is open to external speakers. We are set up for Zoom presentations. Volunteering to give a presentation is welcome.
Date Range: 2019,2020
Primary URL:
http://alamos.math.arizona.edu/ocrworldly-ocr (Web Resource)Title: worldly-ocr
Author: Marek Rychlik
Author: Sayyed Vazirizade
Author: Yan Han
Author: Dylan Murphy
Author: Dwight Nweigwe
Abstract: Data and MATLAB code for a new OCR system
Year: 2018
Primary URL:
https://github.com/mrychlik/worldly-ocrPrimary URL Description: The website contains the data and MATLAB code produced by the project and will be the primary dissemination site for the software products resulting from the project.
Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese (Web Resource)Title: Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese
Author: Marek Rychlik
Author: Dwight Nweigwe
Author: Yan Han
Author: Dylan Murphy
Abstract: We report upon the results of a research and prototype building project \emph{Worldly~OCR} dedicated to developing new, more accurate image-to-text conversion software for several languages and writing systems. These include the cursive scripts Farsi and Pashto, and Latin cursive scripts. We also describe approaches geared towards Traditional Chinese, which is non-cursive, but features an extremely large character set of 65,000 characters. Our methodology is based on Machine Learning, especially Deep Learning, and Data Science, and is directed towards vast quantities of original documents, exceeding a billion pages. The target audience of this paper is a general audience with interest in Digital Humanities or in retrieval of accurate full-text and metadata from digital images.
Year: 2020
Primary URL:
https://arxiv.org/abs/2005.08650Primary URL Description: A version of the "white paper" on a widely known Cornell e-print server.
Marek Rychlik's YouTube channel (Web Resource)Title: Marek Rychlik's YouTube channel
Author: Marek Rychlik
Abstract: The channel features approximately 20 videos created by the software produced by the project, visualizing the algorithms developed by the project.
Year: 2019
Primary URL:
https://www.youtube.com/channel/UCcq2ciH_Eb0rDckJmS_p-XQPrimary URL Description: The videos are a mix of highly technical and easy to understand visualizations, illustrating character recognition, the method of outlines, etc.
One video entitled "Image-to-text conversion for Farsi" illustrates the operation of a full implementation of the OCR pipeline on Farsi text (used as a proxy for Pashto, as at the time of creation we did not have the Pashto training data yet). The video demonstrates 97-98% accuracy on the level of characters.