Program

Digital Humanities: Digital Humanities Advancement Grants

Period of Performance

9/1/2020 - 8/31/2025

Funding Totals

$324,971.00 (approved)
$324,971.00 (awarded)


Computational tools for diachronic and cross-cultural study of literature: multilingual stylometry and phylogenetic profiling

FAIN: HAA-271822-20

University of Texas at Austin (Austin, TX 78712-0100)
Pramit Chaudhuri (Project Director: January 2020 to present)
Joseph Dexter (Co Project Director: July 2020 to present)

The extension of a textual analysis tool kit for stylistic and authorship studies that was originally developed for Latin and ancient Greek to now include capabilities for working with Old English and Bengali resources.

This project, for which we are seeking a Level III Digital Advancement Grant, will expand a suite of tools with which traditionally-trained humanists can analyze literary texts in a quantitative manner. The tools are designed with an important class of literary problems in mind, exemplified by the identification of stylistic effects and the individuating of works within generic traditions. We tackle these problems using two complementary approaches: stylometry augmented by machine learning and phylogenetic profiling. We will leverage our previous research in literary stylistics for the creation of a user-friendly multilingual stylometry toolkit and make enhancements to our existing methods for evolutionary analysis of literature, including automation of key steps. The tools will be tested on a set of problems at the intersection of literary criticism and big data across multiple languages, including Latin, ancient Greek, Old English, and Bengali.





Associated Products

Profiling of Intertextuality in Latin Literature Using Word Embeddings (Article)
Title: Profiling of Intertextuality in Latin Literature Using Word Embeddings
Author: Patrick Burns
Author: James Brofos
Author: Kyle Li
Author: Pramit Chaudhuri
Author: Joseph Dexter
Abstract: Identifying intertextual relationships between authors is of central importance to the study of literature. We report an empirical analysis of intertextuality in classical Latin literature using word embedding models. To enable quantitative evaluation of intertextual search methods, we curate a new dataset of 945 known parallels drawn from traditional scholarship on Latin epic poetry. We train an optimized word2vec model on a large corpus of lemmatized Latin, which achieves state-of-the-art performance for synonym detection and outperforms a widely used lexical method for intertextual search. We then demonstrate that training embeddings on very small corpora can capture salient aspects of literary style and apply this approach to replicate a previous intertextual study of the Roman historian Livy, which relied on hand-crafted stylometric features. Our results advance the development of core computational resources for a major premodern language and highlight a productive avenue for cross-disciplinary collaboration between the study of literature and NLP.
Year: 2021
Primary URL: https://aclanthology.org/2021.naacl-main.389/
Access Model: Open Access
Format: Journal
Periodical Title: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Publisher: Association for Computational Linguistics

Semantic Intertextual Search with Latin Word-Embedding Models (Public Lecture or Presentation)
Title: Semantic Intertextual Search with Latin Word-Embedding Models
Abstract: This paper describes optimization of a computational method for representing semantic information in Latin texts and application of the method to identifying intertextual relationships of literary significance. The distributional hypothesis in linguistics holds that the meaning of a word can be inferred from the contexts in which it is used (Firth); the development of effective methods for computing distributional representations known as word embeddings has revolutionized natural language processing research over the past decade (Mikolov et al., Devlin et al.). We optimize a word embedding model for Latin and use that model to improve existing methods for intertextual search through incorporation of semantic matching...
Author: Joseph Dexter
Author: Pramit Chaudhuri
Date: 01/10/2021
Location: 152nd Annual Meeting of the Society for Classical Studies
Primary URL: https://classicalstudies.org/annual-meeting/152/abstract/semantic-intertextual-search-latin-word-embedding-models

Senecan Trimeter and Humanist Tragedy (Article)
Title: Senecan Trimeter and Humanist Tragedy
Author: Fedchin, A.
Author: Burns, P.
Author: P. Chaudhuri
Author: J. Dexter
Abstract: The lack of extant contemporary comparanda obscures the workings of iambic trimeter in Senecan tragedy. This article offers a quantitative analysis of the reception of Senecan trimeter in four early works of Italian Humanist Tragedy, which illuminates the creative possibilities afforded by the basic structure of the meter and identifies specific features important to questions of style and semantics. Our analysis demonstrates, among other things, that both Seneca and the Humanist tragedians use clusters of resolution in conjunction with antilabe as a literary device to convey high emotion.
Year: 2022
Format: Journal
Periodical Title: American Journal of Philology

A Database of Intertexts in Valerius Flaccus’ Argonautica 1: A Benchmarking Resource for the Evaluation of Computational Intertextual Search of Latin Corpora (Article)
Title: A Database of Intertexts in Valerius Flaccus’ Argonautica 1: A Benchmarking Resource for the Evaluation of Computational Intertextual Search of Latin Corpora
Author: Dexter, J. P.
Author: Chaudhuri, P.
Author: Burns, P. J.
Author: Adams, E. D.
Author: Bolt, T. J.
Author: Cásarez, A.
Author: Flynt, J. H
Author: Li, K.
Author: Patterson, J. F.
Author: Schwartz, A.
Author: Shumway, S.
Abstract: Characterization of intertextual references among authors is fundamental for the study of Latin literature. In this paper, we describe a large-scale intertextuality dataset compiled from three modern commentaries on Valerius Flaccus’ epic poem Argonautica. The dataset includes 945 references to earlier and contemporary Roman authors, as well as associated metadata required for use of multiple intertext search tools. To illustrate the dataset’s reuse potential, we perform a new benchmark analysis of Fīlum, a sequence alignment tool for intertextuality detection.
Year: 2024
Primary URL: https://doi.org/10.5334/johd.153
Access Model: Open access
Format: Journal
Periodical Title: Journal of Open Humanities Data

Leveraging Part-of-Speech Tagging for Enhanced Stylometry of Latin Literature (Article)
Title: Leveraging Part-of-Speech Tagging for Enhanced Stylometry of Latin Literature
Author: Chen, S L.
Author: Burns, P. J.
Author: Bolt, T. J.
Author: Chaudhuri, P.
Author: Dexter, J. P.
Abstract: In literary critical applications, stylometry can benefit from hand-curated feature sets capturing various syntactic and rhetorical functions. For premodern languages, calculation of such features is hampered by a lack of adequate computational resources for accurate part-of-speech tagging and semantic disambiguation. This paper reports an evaluation of POS-taggers for Latin and their use in augmenting a hand-curated stylometric feature set. Our experiments show that POS-augmented features not only provide more accurate counts than POS-blind features but also perform better on tasks such as genre classification. In the course of this work we introduce POS n-grams as a feature for Latin stylometry.
Year: 2024
Primary URL: https://doi.org/10.18653/v1/2024.ml4al-1.24
Access Model: Open access
Format: Journal
Periodical Title: Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Publisher: Association for Computational Linguistics

Computational differentiation of genre and speaker styles in Latin literature (Public Lecture or Presentation)
Title: Computational differentiation of genre and speaker styles in Latin literature
Abstract: This paper presents a quantitative approach to analyzing genre and speech in Latin literature. We build stylometric profiles of the canonical genres of Latin prose and verse based on the frequencies of function words, syntactic features, and other non-lexical markers, and use machine learning to situate the style of each genre within the corpus of classical Latin literature. We then extend these methods to analyze progressively finer distinctions: authors writing within the same genre, differences between subgenres, and finally the styles of speakers in individual works. The resulting profiles offer a multidimensional portrait of the stylistic tendencies typical of the major Latin genres (drama, elegy, and epic for verse, epistolography, historiography, oratory, philosophy, and technical treatise for prose), as well as detailed information about intrageneric and intra-authorial heterogeneity in style.
Author: Dexter, J. P.
Author: Chaudhuri, P.
Date: 02/09/2022
Location: University of Birmingham (UK)