Program

Digital Humanities: Digital Humanities Advancement Grants

Period of Performance

9/1/2020 - 8/31/2025

Funding Totals

$324,874.00 (approved)
$292,054.00 (awarded)


Multilingual BookNLP: Building a Literary NLP Pipeline Across Languages

FAIN: HAA-271654-20

University of California, Berkeley (Berkeley, CA 94704-5940)
David Bamman (Project Director: January 2020 to present)

The expansion of the BookNLP platform for studying the linguistic structure of textual materials to allow for the analysis of resources in Spanish, Japanese, Russian and German.

BookNLP (Bamman et al., 2014) is a natural language processing pipeline for reasoning about the linguistic structure of text of books, specifically designed for works of fiction. In addition to its pipeline of part-of-speech tagging, named entity recognition, and coreference resolution, BookNLP identifies the characters in a literary text, and represents them through the actions they participate in, the objects they possess, their attributes, and dialogue. The availability of this tool has driven much work in the computational humanities, especially surrounding character (Underwood et al., 2018; Kraicer and Piper, 2018; Dubnicek et al., 2018). At the same time, however, BookNLP has one major limitation: it currently only supports texts written in English. The goal of this project is to develop a version of BookNLP to support literature in Spanish, Japanese, Russian and German, and create a blueprint for others to develop it for additional languages in the future.





Associated Products

BookNLP (Computer Program)
Title: BookNLP
Author: David Bamman
Abstract: BookNLP, a natural language processing pipeline for books
Year: 2021
Primary URL: https://github.com/booknlp/booknlp
Primary URL Description: Github Repository for BookNLP
Access Model: Open access
Programming Language/Platform: Python
Source Available?: Yes

Narrative Theory for Computational Narrative Understanding (Article)
Title: Narrative Theory for Computational Narrative Understanding
Author: Andrew Piper
Author: Richard Jean So
Author: David Bamman
Abstract: Over the past decade, the field of natural language processing has developed a wide array of computational methods for reasoning about narrative, including summarization, commonsense inference, and event detection. While this work has brought an important empirical lens for examining narrative, it is by and large divorced from the large body of theoretical work on narrative within the humanities, social and cognitive sciences. In this position paper, we introduce the dominant theoretical frameworks to the NLP community, situate current research in NLP within distinct narratological traditions, and argue that linking computational work in NLP to theory opens up a range of new empirical questions that would both help advance our understanding of narrative and open up new practical applications.
Year: 2021
Primary URL: https://aclanthology.org/2021.emnlp-main.26.pdf
Primary URL Description: ACL Anthology
Access Model: Open access
Format: Journal
Periodical Title: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Publisher: ACL

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4 (Article)
Title: Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
Author: Kent Chang
Author: Mackenzie Cramer
Author: Sandeep Soni
Author: David Bamman
Abstract: In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.
Year: 2023
Primary URL: https://aclanthology.org/2023.emnlp-main.453/
Access Model: Open access
Format: Other
Publisher: ACL

Grounding Characters and Places in Narrative Text (Article)
Title: Grounding Characters and Places in Narrative Text
Author: Sandeep Soni
Author: Amanpreet Sihra
Author: Elizabeth Evans
Author: Matthew Wilkens
Author: David Bamman
Abstract: Tracking characters and locations throughout a story can help improve the understanding of its plot structure. Prior research has analyzed characters and locations from text independently without grounding characters to their locations in narrative time. Here, we address this gap by proposing a new spatial relationship categorization task. The objective of the task is to assign a spatial relationship category for every character and location co-mention within a window of text, taking into consideration linguistic context, narrative tense, and temporal scope. To this end, we annotate spatial relationships in approximately 2500 book excerpts and train a model using contextual embeddings as features to predict these relationships. When applied to a set of books, this model allows us to test several hypotheses on mobility and domestic space, revealing that protagonists are more mobile than non-central characters and that women as characters tend to occupy more interior space than men. Overall, our work is the first step towards joint modeling and analysis of characters and places in narrative text.
Year: 2023
Primary URL: https://aclanthology.org/2023.acl-long.655/
Access Model: Open access
Format: Other
Publisher: ACL