Program

Digital Humanities: Digital Humanities Advancement Grants

Period of Performance

9/1/2020 - 8/31/2021

Funding Totals

$100,000.00 (approved)
$85,808.00 (awarded)


A Knowledge Graph for Managing and Analyzing Spanish American Notary Records

FAIN: HAA-271747-20

University of Missouri, Kansas City (Kansas City, MO 64110-2235)
Viviana L Grieco (Project Director: January 2020 to present)
Praveen Rao (Co Project Director: May 2020 to present)

The development of methods to make it easier for scholars to research historical records, with a focus on 17th century notary records from Argentina. 

We propose to develop a software tool that will enable scholars to expeditiously read and analyze seventeenth century Spanish American notary records and quickly find relevant content in these document collections. Since these records were written in a type of script that was intentionally cryptic, it takes years of training in Spanish American paleography to become proficient in reading and analyzing them. Digital collections contain large amounts of information that can be modeled as a knowledge graph by applying deep learning and knowledge management techniques. The development of such a tool will make notarial scripts accessible to a larger community of researchers without requiring extensive paleography training. By modeling the content in the notary records as a knowledge graph, graph queries will facilitate the identification of legal formulae that characterize types of notarized documents and allow researchers to more efficiently mine the information relevant to their projects.





Associated Products

A Knowledge Graph for Managing and Analyzing Spanish American Notary Records (Web Resource)
Title: A Knowledge Graph for Managing and Analyzing Spanish American Notary Records
Author: Viviana Grieco
Author: Praveen Rao
Abstract: This website summarizes the progress of our work. Its pages provide a history of the project, short biographies of the team members, our presentations and publications. The website also provides a link to our interactive keyboard
Year: 2021
Primary URL: https://www.umkc.edu/mide/NEH-Project/publications.asp
Primary URL Description: Using recent advances in deep learning and knowledge management we will develop a tool to manage and analyze about 220,000 pages of digital images of seventeenth-century manuscripts available at the Archivo General de la República Argentina (National Archives) located in Buenos Aires. This software will enable twenty-first century scholars to expeditiously read and analyze seventeenth century Spanish American notary records and efficiently find relevant content in these documentary collections. Challenges Solution

Archivo Histórico and Deep Learning (Public Lecture or Presentation)
Title: Archivo Histórico and Deep Learning
Abstract: The PIs, GRAs and one of the Consultants provided an update of this project at an event hosted by the National Archives in Argentina
Author: Viviana Grieco
Author: Praveen Rao
Author: Martin Wasserman
Author: Nouf Alrasheed
Author: Shivika Prasanna
Date: 05/21/2021
Location: The presentation was virtual but was recorded
Primary URL: https://www.umkc.edu/mide/NEH-Project/history.asp
Secondary URL: http://ravignani.institutos.filo.uba.ar/evento/archivo-hist%C3%B3rico-y-deep-learning-conversaci%C3%B3n-sobre-el-proyecto

Undergraduate Research Day at the Capitol (Missouri) (Public Lecture or Presentation)
Title: Undergraduate Research Day at the Capitol (Missouri)
Abstract: Undergraduate students associated with this project provided a summary of their research projects before the Missouri Legislators
Author: Ryan Rowland
Author: Adam Cardenas-Sisk
Date: 04/16/2021
Location: Virtual event, Missouri - USA
Primary URL: https://event.crowdcompass.com/ugrd2021
Primary URL Description: Video recording of the presentation
Secondary URL: https://www.umkc.edu/mide/NEH-Project/history.asp

Character Recognition of Seventeenth Century Spanish American Notary Records Using Deep Learning (Article)
Title: Character Recognition of Seventeenth Century Spanish American Notary Records Using Deep Learning
Author: Nouf Alrasheed
Author: Viviana Grieco
Author: Praveen Rao
Abstract: Handwritten character recognition is a challenging pattern recognition problem due to the inconsistency of the handwritten scripts and the lack of accurate labeled data. Historical documents written in cursive are even more challenging as characters have unique and varying shapes. Frequently, words are linked by lines and ornamental doodles. When historical documents are digitized, the images contain various types of noise and degradation, which further complicates the recognition of characters. In this paper, we present an empirical study of how well state-of-the-art convolutional neural networks (CNNs) for image classification perform for the task of recognizing handwritten characters in seventeenth-century Spanish American notarial scripts. Professional historians, paleography experts and trained labelers were involved in preparing the labeled dataset of Spanish characters for training the CNNs.1 The labeled dataset used in this experiment was created from the manuscripts written by one of the multiple scribes that contributed to the collection of approximately 220,000 digitized images of notary records housed at the Archivo General de la Nación Argentina (National Archives). We removed the noise in these images by applying standard image processing techniques. After training different CNNs, we computed the classification accuracy for all the characters. We observed that ResNet-50 achieved a promising accuracy of 97.08% compared to InceptionResnet-V2, Inception-V3, and VGG-16, which achieved 96.66%, 96.33% and 70.91%, respectively.
Year: 2021
Primary URL: http://digitalhumanities.org/dhq/
Secondary URL: https://drive.google.com/file/d/1xD4y6otD35vkZmkYUwKPUL9Pz-C9qTLM/view
Access Model: Open Access
Format: Journal
Periodical Title: Digital Humanities Quarterly
Publisher: Digital Humanities Quarterly 15.4

Evaluation of Deep Learning Techniques for Content Extraction in Spanish Colonial Notary Records (Article)
Title: Evaluation of Deep Learning Techniques for Content Extraction in Spanish Colonial Notary Records
Author: Nouf Alrasheed
Author: Shivika Prasanna
Author: Ryan Rowland
Author: Martin Wasserman
Author: Praveen Rao
Author: Viviana Grieco
Abstract: Processing and analyzing historical manuscripts is considered one of the most challenging problems in the document analysis and recognition domain. Manuscripts written in cursive are even more difficult due to overlapping words with random spacing, irregular and varying characters’ shapes, poor scan quality, and insufficient labeled data. Despite the significant achievements of deep learning approaches in computer vision, handwritten word recognition is far from solved. Most of the existing methods focus on well-segmented word datasets. In this paper, we present an empirical study investigating how well state-of-the-art deep learning models perform on detection and recognition of handwritten words in Spanish American notary records. Professional historians were involved in preparing a labeled dataset of 26,482 Spanish words employed in the experiments. We investigate the performance of some state-of-the-art models on optical character recognition (OCR) on handwritten text documents: Keras-OCR, the object detection algorithm "You Only Look Once" (YOLO), Tesseract OCR, Kraken, and Calamari-OCR. Since YOLO does not include a text recognizer, we propose YOLO-OCR, an innovative model to detect and recognize words in historical manuscripts written in Spanish. Our results show the performance of pre-trained models on our dataset and that Keras-OCR and YOLO-OCR models are highly valuable for content extraction.
Year: 2021
Primary URL: https://drive.google.com/file/d/1aE7Idy7Il-CnxAYf8vnJZeULGe4xwIvv/view
Access Model: Open access
Format: Journal
Periodical Title: 3rd Workshop on structuring and Understanding Multimedia Heritage Contents (SUMAC) co-Multimedia
Publisher: 3rd Workshop on structuring and Understanding Multimedia Heritage Contents (SUMAC) co-Multimedia

Interactive Keyboard (Web Resource)
Title: Interactive Keyboard
Author: Nouf Alrasheed
Author: Viviana Grieco
Author: Praveen Rao
Author: Adam Cardenas-Sisk
Author: Ryan Rowland
Author: Shivika Prasanna
Abstract: We used the characters from our dataset to create this virtual keyboard. The keyboard can be used for paleography training. The set of fonts can be downloaded and used in word processors.
Year: 2021
Primary URL: https://mu-data-science.github.io/KGSAR/

Seventeenth-century Spanish character dataset (Database/Archive/Digital Edition)
Title: Seventeenth-century Spanish character dataset
Author: Nouf Alrasheed
Author: Ryan Rowland
Author: Adam Cardenas-Sisk
Author: Victoria Dominguez
Author: David Freeman
Author: Viviana Grieco
Author: Praveen Rao
Abstract: For the recognition of characters in seventeenth century Spanish American notary scripts, we created a labeled dataset of 24,000 images (1,000 per character present in the archaic Spanish alphabet) out of which 250 samples were manually labeled, and 750 samples were generated. We made this dataset available for future research.
Year: 2021
Primary URL: https://github.com/UMKC-BigDataLab/DeepLearningSpanishAmerican
Access Model: Open Access

Seventeenth-century Spanish words dataset (Database/Archive/Digital Edition)
Title: Seventeenth-century Spanish words dataset
Author: Ryan Rowland
Author: Adam Cardenas-Sisk
Author: Victoria Dominguez
Author: Nouf Alrasheed
Author: Martin Wasserman
Author: Viviana Grieco
Author: Praveen Rao
Author: Shivika Prasanna
Abstract: For word recognition on seventeenth-century Spanish American notary scripts, we created a labeled dataset of 26,482 words out which 6,401 were unique words (the other words appeared more than once on the images). We made this dataset available for future research.
Year: 2021
Primary URL: https://github.com/UMKC-BigDataLab/DeepLearningSpanishAmerican
Access Model: Open Accesss

Evaluation of Deep Learning Techniques for Content Extraction in Spanish Colonial Notary Records (Public Lecture or Presentation)
Title: Evaluation of Deep Learning Techniques for Content Extraction in Spanish Colonial Notary Records
Abstract: Processing and analyzing historical manuscripts is considered one of the most challenging problems in the document analysis and recognition domain. Manuscripts written in cursive are even more difficult due to overlapping words with random spacing, irregular and varying characters’ shapes, poor scan quality, and insufficient labeled data. Despite the significant achievements of deep learning approaches in computer vision, handwritten word recognition is far from solved. Most of the existing methods focus on well-segmented word datasets. In this paper, we present an empirical study investigating how well state-of-the-art deep learning models perform on detection and recognition of handwritten words in Spanish American notary records. Professional historians were involved in preparing a labeled dataset of 26,482 Spanish words employed in the experiments. We investigate the performance of some state-of-the-art models on optical character recognition (OCR) in handwritten text documents: Keras-OCR, the object detection algorithm "You Only Look Once" (YOLO), Tesseract OCR, Kraken, and Calamari-OCR. Since YOLO does not include a text recognizer, we propose YOLO-OCR, an innovative model to detect and recognize words in historical manuscripts written in Spanish. Our results show the performance of pre-trained models on our dataset and that Keras-OCR and YOLO-OCR models are highly valuable for content extraction.
Author: Nouf Alrasheed
Author: Shivika Prasanna
Author: Ryan Rowland
Author: Praveen Rao
Author: Viviana Grieco
Author: Martin Wasserman
Date: 10/20/2021
Location: Chengdu, China
Primary URL: https://sumac-workshops.github.io/2021/

Knowledge Graph (Web Resource)
Title: Knowledge Graph
Author: Shivika Prasanna
Author: Nouf Alrasheed
Author: Praveen Rao
Author: Viviana Grieco
Abstract: Our team developed a knowledge graph (KG) to model the contents of the notary records. Our current KG contains 21 million statements/facts. This KG can be queried to obtain matching records. The entire software is packaged as a Docker container.
Year: 2021
Primary URL: https://github.com/MU-Data-Science/KGSAR

Media Coverage: Two NEH Grants to Aid High-Tech Humanities Research (Web Resource)
Title: Media Coverage: Two NEH Grants to Aid High-Tech Humanities Research
Author: UMKC Today
Abstract: PIs as well as their projects were featured in UMKC's Newsletter.
Year: 2020
Primary URL: https://www.umkc.edu/news/posts/2020/july/two-neh-grants-to-aid-high-tech-humanities-research.html

Media Coverage: MU Engineer Uses Machine Learning to Translate Historical Script (Web Resource)
Title: Media Coverage: MU Engineer Uses Machine Learning to Translate Historical Script
Author: MU College of Engineering Blog
Abstract: PIs and their projects were featured in MU's College of Engineering Blog
Year: 2020
Primary URL: https://showme.missouri.edu/2020/mu-engineer-uses-machine-learning-to-translate-historical-script/

Conversatorio de la Red de Archivos del CONICET. Conversatorio sobre Tecnologías, Desarrollos e Interacciones Disciplinares en Torno a los Archivos (Film/TV/Video Broadcast or Recording)
Title: Conversatorio de la Red de Archivos del CONICET. Conversatorio sobre Tecnologías, Desarrollos e Interacciones Disciplinares en Torno a los Archivos
Writer: Participants
Director: Consejo Nacional de Investigaciones Científicas y Técnicas (Argentina)
Producer: CONICET
Abstract: PIs Grieco and Rao were invited to discuss their project with other research groups working on preservation and access of documentary collections. This was a virtual event hosted by the Consejo Nacional de Investigaciones Cientificas y Tecnicasn (CONICET), Argentina.
Year: 2021
Primary URL: https://www.youtube.com/watch?v=92ExCGLduWI
Access Model: Open Access
Format: Video