home Award Search arrow_forward_ios HJ-50013-10

Program

Digital Humanities: Digging into Data

Period of Performance

1/1/2010 - 3/31/2011

Funding Totals

$100,000.00 (approved)
$100,000.00 (awarded)

Towards Dynamic Variorum Editions

FAIN: HJ-50013-10

Tufts University (Somerville, MA 02144-2401)
Gregory R. Crane (Project Director: July 2009 to September 2011)

This project supports the creation of a framework to produce "dynamic variorum" editions of classics texts that enable the reader to automatically link not only to variant editions but also to relevant citations, quotations, people, and places that are found in a digital library of more than one million primary and secondary source texts. The project team includes members from Tufts University, the University of Massachusetts, Amherst, Imperial College, London, and Mount Allison University.

Building upon collaborations between computer scientists and classicists across three countries, we propose to build a framework that combines emerging technologies and large collections to provide for every surviving Greek and Latin author scalable, sustainable information that can exceed the breadth of traditional bibliographic databases for an entire field and the depth of traditional variorum editions for individual authors and works. We can furthermore identify patterns in the changing reception of and scholarship about Greco-Roman antiquity with greater power and flexibility than was feasible with traditional methods. The work proposed here will demonstrate and analyze the significance of these new methods. Our hypothesis, based on years of development with smaller collections, is that we can now see a wholly new generation of services that better address the most traditional goals of scholarship, are customizable to the needs of far broader audiences, and are much more practical to maintain over time.

Associated Products

Measuring Historical Word Sense Variation (Conference Paper/Presentation)
Title: Measuring Historical Word Sense Variation
Author: Gregory Crane
Author: David Bamman
Abstract: We describe here a method for automatically identifying word sense variation in a dated collection of historical books in a large digital library. By leveraging a small set of known translation book pairs to induce a bilingual sense inventory and labeled training data for a WSD classifier, we are able to automatically classify the Latin word senses in a 389 million word corpus and track the rise and fall of those senses over a span of two thousand years. We evaluate the performance of seven different classifiers both in a tenfold test on 83,892 words from the aligned parallel corpus and on a smaller, manually annotated sample of 525 words, measuring both the overall accuracy of each system and how well that accuracy correlates (via mean square error) to the observed historical variation.
Date: 06/01/11
Primary URL: http://dx.doi.org/10.1145/1998076.1998078
Primary URL Description: A link to the final published version in the ACM Digital Library.
Secondary URL: http://www.perseus.tufts.edu/publications/bamman-11.pdf
Secondary URL Description: A link to a preprint of this paper on the Perseus Digital Library.
Conference Name: http://www.perseus.tufts.edu/publications/bamman-11.pdf

Extracting Two Thousand Years of Latin from a Million Book Library (Article)
Title: Extracting Two Thousand Years of Latin from a Million Book Library
Author: David Bamman
Author: David Smith
Abstract: With the rise of large open digitization projects such as the Internet Archive and Google Books, we are witnessing an explosive growth in the number of source texts becoming available to researchers in historical languages. The Internet Archive alone contains over 27,014 texts catalogued as Latin, including classical prose and poetry written under the Roman Empire, ecclesiastical treatises from the Middle Ages, and dissertations from 19th-century Germany written - in Latin - on the philosophy of Hegel. At one billion words, this collection eclipses the extant corpus of Classical Latin by several orders of magnitude. In addition, the much larger collection of books in English, German, French, and other languages already scanned contains unknown numbers of translations for many Latin books, or parts of books. The sheer scale of this collection offers a broad vista of new research questions, and we focus here on both the opportunities and challenges of computing over such a large space of heterogeneous texts. The works in this massive collection do not constitute a nearly curated (or much less balanced) corpus of Latin; it is, instead, simply all the Latin that can be extracted, and in its reach of twenty-one centuries (from ca. 200 BCE to 1922 CE) arguably spans the greatest historical distance of any major textual collection today. While we might hope that the size and historical reach of this collection can eventually offer insight into grand questions such as the evolution of a language over both time and space, we must contend as well with the noise inherent in a corpus that has been assembled with minimal human intervention.
Year: 2012
Primary URL: http://nlp.perseus.tufts.edu/docs/etc/jocch.pdf
Primary URL Description: Preprint of article under review.
Format: Journal
Periodical Title: Journal of Computing and Cultural Heritage