New Languages for NLP: Building Linguistic Diversity in the Digital Humanities
FAIN: HT-272570-20
Princeton University (Princeton, NJ 08540-5228)
Natalia Ermolaev (Project Director: March 2020 to present)
Andrew Janco (Co Project Director: July 2020 to present)
an institute to help humanities scholars learn how to create linguistic data and apply statistical models to new languages.
Natural Language Processing (NLP) has revolutionized our ability to interpret texts at scale and is an essential tool for scholars in the digital humanities. However, only a small percentage of the world’s languages are supported by the major NLP libraries. The New Languages for NLP Institute will help scholars with expertise in less-resourced languages to create linguistic data and train NLP models for their languages. In three workshops, held at the Center for Digital Humanities at Princeton University in 2021-2022, participants will create linguistic data and train statistical language models for new languages. They will learn best practices in project and research data management. As an outcome of the project, participants will publish an open dataset in the standard Conference on Computational Natural Language Learning format as well as a trained language model that can be used for computational text analysis.
Associated Products
New Languages for NLP Course Materials (Course or Curricular Material)Title: New Languages for NLP Course Materials
Author: Andrew Janco
Author: Natalia Ermolaev
Author: Toma Tasovac
Author: David Lassner
Author: Quinn Dombrowski
Author: Anubhav Sharma
Abstract: This site provides an open reference resource for participants during the workshops and acts as the first draft of materials for the online course. The course materials site has sections that present pre-requisite skills and knowledge. It has entries for each session during the workshops with supporting information and instructions. The overall goal of the course materials site is to provide an ongoing reference work to support participants’ work and asynchronous learning.
Year: 2021
Primary URL:
https://new-languages-for-nlp.github.io/course-materials/intro.htmlPrimary URL Description: This is the URL for the course materials.
Audience: Graduate
New Languages for NLP project website (Web Resource)Title: New Languages for NLP project website
Author: Andrew Janco
Author: Natalia Ermolaev
Abstract: The project website serves as the public-facing informational source for the project. This is where we articulated our aims and goals, as well as the significance of our project. We have a page that describes our languages, team-members, and research goals. The full schedules for our workshops are posted publicly here.
Year: 2021
Primary URL:
https://newnlp.princeton.edu/Primary URL Description: Project website URL
Cadet (Computer Program)Title: Cadet
Author: Andrew Janco
Abstract: Cadet is an open-source Python web application that was created in 2021 by Andrew Janco to facilitate participants’ work and will be shared with the general public following the grant. The application facilitates the customization of language defaults for tokenization and lookups data. Cadet also uses token frequency to bulk annotate frequent unambiguous terms and to shorten the time needed for annotation.
Year: 2021
Primary URL:
https://github.com/New-Languages-for-NLP/cadetPrimary URL Description: Source code for Cadet
Access Model: Open-source
Programming Language/Platform: Python
Source Available?: Yes
Eisenstein (Computer Program)Title: Eisenstein
Author: Andrew Janco
Abstract: “Eisenstein” is an open-source Python web application that was built in 2021 by Andrew Janco for participants that needed optical character recognition using Tesseract. This web application simplifies Tesseract text extraction in over one hundred languages:
Year: 2021
Primary URL:
https://eisenstein.apjan.co/Primary URL Description: User-facing website
Secondary URL:
https://github.com/apjanco/eisensteinSecondary URL Description: Source code
Access Model: Open-source
Programming Language/Platform: Python
Source Available?: Yes