Program

Digital Humanities: Digital Humanities Advancement Grants

Period of Performance

9/1/2018 - 8/31/2022

Funding Totals

$323,767.00 (approved)
$323,767.00 (awarded)


A Linked Digital Environment for Coptic Studies

FAIN: HAA-261271-18

Georgetown University (Washington, DC 20057-0001)
Amir Zeldes (Project Director: January 2018 to present)
Caroline T. Schroeder (Co Project Director: May 2018 to present)

The creation and expansion of a suite of language processing tools to better analyze documents written in Coptic – the language of first millennium Egypt – and other ancient Near Eastern languages.

Building on our previous work in Natural Language Processing for Coptic, we will capitalize on recent advances in Digital Humanities & Computational Linguistics to strengthen tools & data available for Coptic. Specifically, we will harness Deep Learning methods to handle a variety of source materials, including OCR data & editions with varying orthography, enhance materials via Linked Open Data and automatic Named Entity Recognition, & integrate automatic syntactic analyses into our materials.





Associated Products

Building Linguistically and Intertextually-Tagged Coptic Corpora with Open Source Tools (Conference Paper/Presentation)
Title: Building Linguistically and Intertextually-Tagged Coptic Corpora with Open Source Tools
Author: Miyagawa, So, Zeldes, Amir, Büchler, Marco, Behlmer, Heike and Griffitts, Troy
Abstract: Coptic is the last stage of the Egyptian language. Before Coptic, Ancient Egyptian was written in Hieroglyphs, Hieratic, and Demotic scripts. Starting in the third century CE (excluding “Old Coptic”), Coptic used an alphabet based on the Greek and several added Demotic letters. A large but understudied corpus of literary texts exists in Coptic, including important Gnostic, monastic and Manichaean texts, as well as early Biblical translations. Efforts to build a digital Coptic corpus are still in their initial phases. In this paper, we present the most recent work in a partnership of Digital Humanities projects. Coptic SCRIPTORIUM (Schroeder and Zeldes, 2016) is a major initiative endeavoring to put corpora online which are linguistically and philologically annotated (i.e. supporting grammatical, paleographical and literary annotations), while projects in Göttingen are producing digital editions of Coptic texts focusing on philological standards and critical editions: A project at the Göttingen Academy of Sciences and Humanities is preparing a complete digital edition of the Coptic Old Testament (Behlmer and Feder, 2017), and in a project of Collaborative Research Centre 1136 “Education and Religion” digital diplomatic editions of selected works of Shenoute and Besa, 4th-5th century abbots of the White Monastery in Upper Egypt, are being prepared for text reuse research. Based on our experiences, we have schematized workflows for building Coptic corpora with linguistic and literary information by using open source programs, merging data from OCR (Optical Character Recognition) and transcription sources, Natural Language Processing (NLP) tools, and manual annotation interfaces allowing for the correction of automatic tool output.
Date: 9/11/2018
Conference Name: Proceedings of JADH2018

Understanding Space and Place through Digital Text Analysis (Conference Paper/Presentation)
Title: Understanding Space and Place through Digital Text Analysis
Author: Schroeder, Caroline T.
Abstract: Understanding Space and Place through Digital Text Analysis
Date: 2/25/2019
Conference Name: Third PAThs International Conference: Coptic Literature in Context. The Contexts of Coptic Literature: Late Antique Egypt in a dialogue between literature, archaeology and digital humanities. Sapienza University, Rome

A Characterwise Windowed Approach to Hebrew Morphological Segmentation (Conference Paper/Presentation)
Title: A Characterwise Windowed Approach to Hebrew Morphological Segmentation
Author: Zeldes, Amir
Abstract: This paper presents a novel approach to the segmentation of orthographic word forms in contemporary Hebrew, focusing purely on splitting without carrying out morphological analysis or disambiguation. Casting the analysis task as character-wise binary classification and using adjacent character and wordbased lexicon-lookup features, this approach achieves over 98% accuracy on the benchmark SPMRL shared task data for Hebrew, and 97% accuracy on a new out of domain Wikipedia dataset, an improvement of ˜4% and 5% over previous state of the art performance.
Date: 10/31/2018
Primary URL: https://aclweb.org/anthology/W18-5811
Primary URL Description: Full paper
Conference Name: 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology at EMNLP 2018, Brussels, Belgium

The Coptic Universal Dependency Treebank (Conference Paper/Presentation)
Title: The Coptic Universal Dependency Treebank
Author: Zeldes, Amir and Abrams, Mitchell
Abstract: This paper presents the Coptic Universal Dependency Treebank, the first dependency treebank within the Egyptian subfamily of the Afro-Asiatic languages. We discuss the composition of the corpus, challenges in adapting the UD annotation scheme to existing conventions for annotating Coptic, and evaluate inter-annotator agreement on UD annotation for the language. Some specific constructions are taken as a starting point for discussing several more general UD annotation guidelines, in particular for appositions, ambiguous passivization, incorporation and object-doubling.
Date: 11/1/2018
Primary URL Description: Full paper
Secondary URL: https://aclweb.org/anthology/W18-6022
Conference Name: Proceedings of the Universal Dependencies Workshop 2018

Building a Collaborative Environment for Digital Coptic Studies (Conference Paper/Presentation)
Title: Building a Collaborative Environment for Digital Coptic Studies
Author: Caroline T. Schroeder
Author: Amir Zeldes
Abstract: Small fields of research, such as Coptic Studies, bring challenges: Few departments have specialists. Resources with established funding structures are rare. Individuals produce valuable text editions, but in print or heterogeneous digital editions. Complex morphology (e.g., in Coptic, word forms containing verbs and objects) make quantitative work and search difficult. Additionally, due to colonial history, many Coptic texts are unpublished, fragmentary, or dismembered (in different libraries around the globe). Coptic Scriptorium brings cultural heritage resources online in machine readable and open formats to address these challenges. This paper demonstrates how the project’s collaborative annotation tools and natural language processing tools leverage interdisciplinary methods to produce open corpora for research and cultural heritage preservation.
Date: 04/15/2019
Conference Name: Workshop on Digital Humanities to Preserve Knowledge and Cultural Heritage, Stanford University Center for Spatial and Textual Analysi

Coptic Scriptorium. Lecture for Sunoikisis Digital Humanities Course (Conference Paper/Presentation)
Title: Coptic Scriptorium. Lecture for Sunoikisis Digital Humanities Course
Author: Caroline T. Schroeder
Abstract: Coptic Scriptorium is an interdisciplinary project dedicated to the digital and computational study of Coptic Language and Literature. The last phase of the ancient Egyptian language, Coptic was prominent in Egypt during the Roman and early Byzantine periods and is important for research in Religious Studies, Linguistics, Biblical Studies, Papyrology, Classics, Egyptology, and other fields. This presentation will introduce the project and goals, demonstrate the tools and technology available for researchers, provide an overview of key aspects "under the hood" that go into making the project, and assess challenges in this kind of work. We will end with a summary of future directions in the project and an invitation for researchers to get involved.
Date: 07/11/2019
Primary URL: https://github.com/SunoikisisDC/SunoikisisDC-2018-2019/wiki/Summer2019-Session15
Conference Name: Sunoikisis Digital Humanities Course

The Making of Coptic Wordnet (Conference Paper/Presentation)
Title: The Making of Coptic Wordnet
Author: Laura Slaughter
Author: Luis Morgado Da Costa
Author: So Miyagawa
Author: Marco Büchler
Author: Amir Zeldes
Author: Heike Behlmer
Abstract: With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Latin, gaps remain in the coverage of less studied languages of antiquity. This paper reports on the construction and evaluation of a new wordnet for Coptic, the language of Late Roman, Byzantine and Early Islamic Egypt in the first millenium CE. We present our approach to constructing the wordnet which uses multilingual Coptic dictionaries and wordnets for five different languages. We further discuss the results of this effort and outline our on-going/future work.
Date: 07/23/2019
Conference Name: Global Wordnet Conference (GWC 2019)

Computational Tools and the Cross-Cultural Study of Literature (Public Lecture or Presentation)
Title: Computational Tools and the Cross-Cultural Study of Literature
Abstract: In this plenary panel, four Digital Humanities researchers from different disciplines—English, Classics, Religious Studies, East Asian Studies— will discuss the importance of DH research for literary studies, including the role of DH literary studies in public and social engagement. Each panelist will deliver brief remarks followed by a lengthy question-and-answer session.
Author: Caroline T. Schroeder
Date: 04/25/2019
Location: Dartmouth University, Quantitative Criticism Lab DH Conference

Coptic Studies in a Digital Age: Building Communities and Challenging Orthodoxies (Conference Paper/Presentation)
Title: Coptic Studies in a Digital Age: Building Communities and Challenging Orthodoxies
Author: Caroline T. Schroeder
Abstract: Building a truly interdisciplinary and collaborative digital research project poses a number of challenges. This paper will use the creation and growth of the Coptic Scriptorium project to address the challenges of building interdisciplinary and international collaborations. It will also outline ways in which successful interdisciplinary projects sometimes need to creatively adapt, ignore, or even subvert existing standards and orthodoxies in digital studies.
Date: 04/25/2019
Conference Name: Quantitative Criticism Lab DH Conference, Dartmouth

Ancient Languages in a Digital Age: Building an Online Environment for Coptic Studies (Conference Paper/Presentation)
Title: Ancient Languages in a Digital Age: Building an Online Environment for Coptic Studies
Author: Caroline T. Schroeder
Abstract: Creating Coptic Scriptorium addressed a need in the study of the ancient and medieval Mediterranean world: an open access digital environment for the study of the language and literature of Roman and Byzantine Egypt. This lecture addresses how Coptic Scriptorium leveraged interdisciplinary methods from Linguistics, History, Religious Studies, and Classics to build on previous traditional and digital scholarship to create something new. It outlines challenges to work in digital Coptic as well as avenues for future research, especially through collaboration with other digital research groups.
Date: 04/04/2019
Conference Name: Ancient History and Mediterranean Archaeology Colloquium, University of California at Berkeley

Coptic Scriptorium V3.0.0 (Database/Archive/Digital Edition)
Title: Coptic Scriptorium V3.0.0
Author: Caroline T. Schroeder
Author: Amir Zeldes
Abstract: Coptic Scriptorium is happy to announce our latest data release, including a variety of new sources thanks to our collaborators (digitized data courtesy of the Marcion and PAThs projects!). New in this release are: Saints' lives Life of Cyrus Life of Onnophrius Lives of Longinus and Lucius Martyrdom of Victor the General (part 2) Miscellaneous: Dormition of John Homilies of Proclus Letter of Pseudo-Ephrem We are also releasing expansions to some of our existing corpora, including: Canons of Johannes (new material annotated by Elizabeth Platte and Caroline T. Schroeder, digital edition provided by Diliana Atanassova) Apophthegmata Patrum A large number of corrections to most of our existing corpora, which are being republished in this release. All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold). Also new in this release are stable identifiers and links to PATHS identifiers (https://atlas.paths-erc.eu/cite), specifically some corpora now contain metadata identifiers for paths manuscripts, paths works, and paths authors.
Year: 2019
Primary URL: https://github.com/CopticScriptorium/corpora/
Primary URL Description: Version controlled downloadable archives
Secondary URL: http://data.copticscriptorium.org
Secondary URL Description: Browsable web interface
Access Model: Open Access (Creative Commons)

Coptic NLP V3.0.0 (Computer Program)
Title: Coptic NLP V3.0.0
Author: Amir Zeldes
Abstract: Coptic NLP Pipeline An end-to-end NLP pipeline for Coptic text in UTF-8 encoding. Online production version available as a web interface at: https://corpling.uis.georgetown.edu/coptic-nlp/
Year: 2019
Primary URL: https://github.com/CopticScriptorium/coptic-nlp
Primary URL Description: Source code
Secondary URL: https://corpling.uis.georgetown.edu/coptic-nlp/
Secondary URL Description: Demo
Access Model: Open Source (Apache License)
Programming Language/Platform: Python
Source Available?: Yes

RFTokenizer V1.0.1 (Computer Program)
Title: RFTokenizer V1.0.1
Author: Amir Zeldes
Abstract: RFTokenizer A character-wise tokenizer for morphologically rich languages
Year: 2019
Primary URL: https://github.com/amir-zeldes/RFTokenizer
Access Model: Open Source (Apache License)
Programming Language/Platform: Python
Source Available?: Yes

Digital Approaches to Studying Authorial Style and Monastic Subjectivity in Early Christian Egypt (Article)
Title: Digital Approaches to Studying Authorial Style and Monastic Subjectivity in Early Christian Egypt
Author: Rebecca Krawiec
Author: Caroline T. Schroeder
Abstract: This chapter explains the major tools Coptic Scriptorium provides and explore their implication for the study of religion, particularly in late antique Egypt. The corpora present a range of monastic and biblical texts, richly annotated for multiple forms of search and analysis. The annotation allows for scholars not only to search efficiently for particular words, or particular parts of speech, but also to locate them within the larger literary structures of the particular text. Such analysis allows both tradition philological approaches in a digital platform but more importantly has the potential to enhance our understanding of the various “writerly subjectivities” of authors. It will also help us investigate the development of a particular monastic vocabulary. Coptic Scriptorium provides the tools to determine grammatical patterns and variance that in turn will lead to richer understanding of rhetorical choices and styles used to construct the monastic and religious landscape of early Christian Egypt.
Year: 2021
Primary URL: https://www.degruyter.com/document/doi/10.1515/9783110573022-005/html
Primary URL Description: Publisher website
Format: Other
Periodical Title: Digital Humanities Research Methods in Religious Studies
Publisher: DeGruyter

Cultural Heritage Preservation and Canon Formation: What Syriac and Coptic Can Teach Us about the Historiography of the Digital Humanities (Book Section)
Title: Cultural Heritage Preservation and Canon Formation: What Syriac and Coptic Can Teach Us about the Historiography of the Digital Humanities
Author: Caroline T. Schroeder
Editor: Georgia Frank
Editor: Susan Holman
Editor: Andrew Jacobs
Abstract: This paper uses the historiography of Syriac and Coptic Studies to reconsider and recast the origin story of rapidly rising field in the academy, namely the Digital Humanities. Origin stories about the Digital Humanities (or “humanities computing” as it was known in its early years) often begin with Fr. Roberto Busa, a theologian and Aquinas scholar who teamed up with IBM’s Thomas J. Watson to produce the computing infrastructure behind the print publication of the Index Thomisticus. The Index Thomisticus can also be understood as an act of cultural heritage preservation and promotion; this early humanities computing was in the service of studying and eventually digitizing a pillar of the Western canon. The paper advocates broadening our horizons for identifying “origin moments” or “origin points” in the field of Digital Humanities, particularly as DH intersects with religion, Christianity, and Christian canons. Cultural heritage groups outside the Catholic and Protestant American mainstream conducted early digitization of Christian literary sources and scholarship. The work of Coptic Orthodox Christians and Syrian Christians to digitize Coptic and Syriac literature, promote open access scholarship, publish on digital platforms, and engage in “public humanities” (all central elements of what we now understand “digital humanities” to be) were, like Busa’s Index Thomisticus, acts of cultural heritage preservation and acts of digital humanities pioneering.
Year: 2019
Primary URL: https://www.fordhampress.com/9780823287024/the-garb-of-being/
Publisher: Fordham University Press
ISBN: 0823287041

The Digital Futures of Coptic Texts and the Coptic Scriptorium Project (Conference Paper/Presentation)
Title: The Digital Futures of Coptic Texts and the Coptic Scriptorium Project
Author: Caroline T. Schroeder
Abstract: This paper presents the latest developments in the Coptic Scriptorium project as they pertain to understanding archaeological artifacts, manuscripts, and digital objects. Part of a Workshop on “The Digital Futures of Ancient Objects: Discussing Next Steps for Collaborative Digital Humanities Projects”
Date: 01/03/2020

Ancient Egyptian Text as Data: Curating 'Small Data' in the Era of Analytics (Public Lecture or Presentation)
Title: Ancient Egyptian Text as Data: Curating 'Small Data' in the Era of Analytics
Abstract: In the era of Data Analytics and “Big Data”, what role does the study of underresourced language data play? This paper outlines data curation issues for a smaller language data set from the Egyptian language during the Roman era. Some of the principles developed while curating new “small data” resources can, and arguably should, be translated to “big data”, as well.
Author: Caroline T. Schroeder
Date: 02/03/2020
Location: College of Arts and Sciences Data Science Colloquium, University of Oklahoma

A Collaborative Ecosystem for Digital Coptic Studies (Article)
Title: A Collaborative Ecosystem for Digital Coptic Studies
Author: Caroline T. Schroeder
Author: Amir Zeldes
Abstract: Scholarship on underresourced languages bring with them a variety of challenges which make access to the full spectrum of source materials and their evaluation difficult. For Coptic in particular, large scale analyses and any kind of quantitative work become difficult due to the fragmentation of manuscripts, the highly fusional nature of an incorporational morphology, and the complications of dealing with influences from Hellenistic era Greek, among other concerns. Many of these challenges, however, can be addressed using Digital Humanities tools and standards. In this paper, we outline some of the latest developments in Coptic Scriptorium, a DH project dedicated to bringing Coptic resources online in uniform, machine readable, and openly available formats. Collaborative web-based tools create online 'virtual departments' in which scholars dispersed sparsely across the globe can collaborate, and natural language processing tools counterbalance the scarcity of trained editors by enabling machine processing of Coptic text to produce searchable, annotated corpora.
Year: 2020
Primary URL: https://jdmdh.episciences.org/6797
Primary URL Description: Paper
Access Model: open access
Format: Journal
Periodical Title: Journal of Data Mining and Digital Humanities
Publisher: CNRS

Exposing Coptic entities: automation, search and visualization (Conference Paper/Presentation)
Title: Exposing Coptic entities: automation, search and visualization
Author: Amir Zeldes
Author: Caroline T. Schroeder
Author: Lance Martin
Abstract: Entity recognition is a gateway technology for semantic access to ancient materials in the Digital Humanities: they allow users to discover resources about people and places of interest which they could not possibly read exhaustively, they facilitate linking between resources and projects and they provide a window into what a text discusses, even for datasets for which translations are not available. In this paper we explore entity recognition for Coptic, the language of Hellenistic era Egypt in the first millennium CE. We evaluate a number of standard Natural Language Processing approaches to the problem and lay out the difficulties in applying them to a low-resource, morphologically complex language. We present our own solutions for named and non-named nested entity recognition and entity linking, relying on a robust parsing strategy, feature-based CRF models, and some hand-crafted knowledge base resources, which enable high accuracy entity recognition with orders of magnitude less data than commonly used for NER. The success of the approach suggests avenues for research on other languages in similar settings.
Date: 07/12/2020
Primary URL: http://kellia.uni-goettingen.de/digitalcoptic3/slides/DC3_entities_2020.pdf
Conference Name: Digital Coptic 3

An Overview of the Coptic Wordnet Project (Conference Paper/Presentation)
Title: An Overview of the Coptic Wordnet Project
Author: Laura Slaughter
Author: So Miyagawa
Author: Luis Morgado da Costa
Author: Amir Zeldes
Author: Heike Behlmer
Author: Hugo Lundhaug
Abstract: This paper reports on the process of constructing a Wordnet for the Coptic language. We will present our work on constructing the Coptic Wordnet and outline the goals for this on-going project, as well as an evaluation of its current coverage and future perspectives.
Date: 07/13/2020
Primary URL: http://kellia.uni-goettingen.de/digitalcoptic3/slides/Coptic%20WN%20-%20July%202020.pdf
Primary URL Description: Presentation
Conference Name: Digital Coptic 3

Understanding Space and Place through Digital Text Analysis (Article)
Title: Understanding Space and Place through Digital Text Analysis
Author: Caroline T. Schroeder
Abstract: Digital text analysis can identify named geographic entities within Egypt; entity recognition technology can also identify unnamed entities and abstractions. With a digitised, annotated corpus we can also research how Coptic literature talks about spaces and place. This paper will introduce the Coptic SCRIPTORIUM research platform, examine case studies in geospatial research using Coptic SCRIPTORIUM’s digital resources, and explore future areas of scholarship. Throughout, we will pay close attention to the new research questions digital technology enables as well as to the challenges for such research. Digital and computational methods enhance traditional textual scholarship and enable new modes of inquiry for understanding spaces and places in the Coptic literary world.
Year: 2020
Primary URL: https://www.torrossa.com/en/resources/an/4661856
Access Model: open access
Format: Other
Periodical Title: Proceedings of the Third PATHs Conference
Publisher: Edizioni Quasar

Exhaustive Entity Recognition for Coptic: Challenges And Solutions (Conference Paper/Presentation)
Title: Exhaustive Entity Recognition for Coptic: Challenges And Solutions
Author: Amir Zeldes
Author: Lance Martin
Author: Sichang Tu
Abstract: Entity recognition provides semantic access to ancient materials in the Digital Humanities: it exposes people and places of interest in texts that cannot be read exhaustively, facilitates linking resources and can provide a window into text contents, even for texts with no translations. In this paper we present entity recognition for Coptic, the language of Hellenistic era Egypt. We evaluate NLP approaches to the task and lay out difficulties in applying them to a low-resource, morphologically complex language. We present solutions for named and non-named nested entity recognition and semi-automatic entity linking to Wikipedia, relying on robust dependency parsing, feature-based CRF models, and hand-crafted knowledge base resources, enabling high accuracy NER with orders of magnitude less data than those used for high resource languages. The results suggest avenues for research on other languages in similar settings.
Date: 12/12/2020
Primary URL: https://www.aclweb.org/anthology/2020.latechclfl-1.3/
Primary URL Description: ACL Anthology record
Secondary URL: https://www.aclweb.org/anthology/2020.latechclfl-1.3.pdf
Secondary URL Description: Full paper PDF
Conference Name: Proceedings of the SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2020)

Linking Entity Annotation (Conference Paper/Presentation)
Title: Linking Entity Annotation
Author: Lance Martin
Abstract: Coptic Scriptorium links named entity data to their corresponding Wikipedia page, as process commonly known as wikification. Wikification provides a common table of reference for many other projects studying the ancient world. As such, it facilitates cooperation with adjacent projects, including Pleiades, Syriaca.org, and Trismegistos.
Date: 11/30/2020
Primary URL: https://issuu.com/schoolofadvancedstudy/docs/_lp6-posters
Primary URL Description: Poster gallery
Secondary URL: https://ics.sas.ac.uk/events/linked-pasts-6
Secondary URL Description: Conference website
Conference Name: Linked Pasts 6 (London, England / online)

Coptic Scriptorium – Guidelines Overview (Course or Curricular Material)
Title: Coptic Scriptorium – Guidelines Overview
Author: Amir Zeldes
Author: Lance Martin
Abstract: This document presents a condensed summary of Coptic Scriptorium guidelines on handling diplomatic transcription, word segmentation, part of speech tagging, lemmatization and more.
Year: 2021
Primary URL: https://copticscriptorium.org/download/scriptorium_guidelines_overview.pdf
Primary URL Description: Technical Report, Georgetown University
Audience: Other

Coptic Scriptorium - Entity Annotation Guidelines (Course or Curricular Material)
Title: Coptic Scriptorium - Entity Annotation Guidelines
Author: Amir Zeldes
Author: Lance Martin
Abstract: Entity annotation concerns the annotation of referring expressions in a text, i.e. spans of text that refer to things in the world, and their classification into entity types. The purpose of entity annotation in Coptic Scriptorium is to facilitate searches which include specific entity types (e.g. finding a certain epithet using linguistic annotations, such as ⲟⲩⲁⲁⲃ ‘holy’, but only when applied to a PERSON), to inventorize entities (find all cases of e.g. places mentioned in the Apophthegmata Patrum), and to function as a gateway for entity linking, enabling searches for specific persons (“John the Baptist”), regardless of the exact expression used to mention them. The latter task of entity linking is left outside of the scope of the current guidelines.
Year: 2020
Primary URL: https://github.com/CopticScriptorium/entity-tagging/raw/master/coptic_scriptorium_entity_guidelines.pdf
Primary URL Description: Technical Report, Georgetown University
Audience: Other

Digitally Linking People and Places in Coptic Literature (Conference Paper/Presentation)
Title: Digitally Linking People and Places in Coptic Literature
Author: Lance Martin
Author: Caroline T. Schroeder
Author: Amir Zeldes
Abstract: The Coptic Scriptorium project (https://copticscriptorium.org) uses interdisciplinary digital and computational methods to make richly annotated Coptic texts openly available online. In this paper, we present the project’s latest work on linked open data for named entities in Coptic literature. Linked open data is a feature enabled by the open web, where publicly available data in one place (such as Coptic Scriptorium’s website) links to stable information related to that data elsewhere on the public internet. Our recent work has focused on identifying named entities (people, places and more) mentioned in Coptic texts and linking them to other resources online. This paper presents the newest developments of our entity coverage, progress in “wikification” (linking to Wikipedia entries) for our Sahidic New Testament and Sahidic Old Testament data, and new applications and research directions for researchers.
Date: 7/16/2021
Primary URL: http://www.stshenouda.org/22conferencereg2021
Primary URL Description: Conference website
Conference Name: 22nd St. Shenouda-UCLA Conference of Coptic Studies

Leveraging non-named entities in Coptic antiquity (Public Lecture or Presentation)
Title: Leveraging non-named entities in Coptic antiquity
Abstract: In this paper we present the latest work on large scale, semi-automatic and quantitative analysis of the body of entities mentioned in texts from Coptic Antiquity. Unlike Greek and Roman materials, which have been studied extensively, digital treatment of Coptic data from the first millennium has lagged behind until recently, in part due to the smaller research community and the morphological complexity of the language, which is fusional and features agglutination, compounding and incorporation of nouns into complex verbs. We will show how annotating named and non-named entities enriches Coptic corpora, including the identification of nested entities and entity linking. We will focus especially on non-named Coptic entities, which are of great interest to scholars working on monasticism and asceticism, since many central texts revolve around unnamed protagonists (e.g. ‘an ascetic’) in unidentified locations (‘a monastery’). In many cases, the proportion of named entities is well below 5%. The relevance of entity annotation is further demonstrated through visualizations of Coptic entities, which enable researchers to access a variety of information in Coptic corpora through distant reading, allowing them to explore easily types of places in different works, to get lists of nouns referring to organizations, events or animals, examine feminine vs. masculine nouns denoting people and more. From a comparative quantitative perspective, the prevalence of non-named entities of different types can also reveal dissimilarities between texts. Figure 2 juxtaposes the ranked ratio of mentions for people vs. places, illustrating which works focus on human behavior than on descriptions of environmental settings.
Author: Amir Zeldes
Author: Caroline T. Schroeder
Author: Lance Martin
Date: 9/10/2021
Location: The Digital Classicist Seminar, London/online
Primary URL: https://www.digitalclassicist.org/
Primary URL Description: Seminar Series website.

Coptic Scriptorium V4.2.0 (Database/Archive/Digital Edition)
Title: Coptic Scriptorium V4.2.0
Author: Caroline T. Schroeder
Author: Amir Zeldes
Author: Lance Martin
Abstract: This is the public repository for Coptic SCRIPTORIUM corpora. The documents are available in multiple formats: CoNLL-U, relANNIS, PAULA XML, TEI XML, and TreeTagger SGML (*.tt). The *.tt files generally contain the most complete representations of document annotations, though note that corpus level metadata is only included in the PAULA XML and relANNIS versions. Corpora can be searched, viewed, and queried with complex queries http://data.copticscriptorium.org. Project homepage is http://copticscriptorium.org
Year: 2021
Primary URL: https://github.com/CopticScriptorium/corpora
Primary URL Description: Online version controlled repository
Access Model: Open Access, Creative Commons license

UD Treebanking for Coptic DH. Low Resource NLP Technologies for NER, Lexicography and Linked Open Data (Conference Paper/Presentation)
Title: UD Treebanking for Coptic DH. Low Resource NLP Technologies for NER, Lexicography and Linked Open Data
Author: Amir Zeldes
Abstract: he Universal Dependencies project, which provides morphosyntactically analyzed data in over 100 languages, offers homogeneous annotation schemes and workflows for both Big Data languages such as English, and Low Resource languages often at the heart of Digital Humanities work. In this talk I will present work on a language from the latter group: Coptic, the language of 1st millennium Egypt. Thanks to progress in NLP technologies and the development of UD annotated data, our project, Coptic Scriptorium (https://copticscriptorium.org/) has been able to create fully automatic tools for analyzing Coptic data, including morphological analysis, part-of-speech tagging, lemmatization, parsing and entity recognition. These analyses feed a suite of tools enabling Named Entity Linking to open data such as Wikipedia, as well as automatic generation of lexicographic examples and entity-type based Word Sense Disambiguation in an online dictionary. This work shows that a variety of technologies often assumed to be relevant mainly for Big Data languages, such as Deep Learning, Transformers (BERT) and more, can work well when even modest amounts of richly annotated UD data are available for bootstrapping.
Date: 10/2/2021
Primary URL: https://www.kyoto-u-digitization.org/oct-2-digital-corpus-universal-dependencies-east-asian-and-coptic
Primary URL Description: Conference website
Conference Name: Digital Transformation in the Humanities, KUDH International Conference