The Dynamic Lexicon: Cyberinfrastructure and the Automatic Analysis of Historical Languages
FAIN: PR-50013-08
Tufts University (Somerville, MA 02144-2401)
Gregory R. Crane (Project Director: July 2007 to April 2012)
Research on methods to generate a dynamic lexicon for a text corpus in a digital library. Using Greek and Latin texts, the project would investigate processes to enumerate possible senses for the words being defined and provide detailed syntactic information and statistical data about their use in a corpus.
We propose to research core functions for the automatic analysis of historical languages (Greek & Latin) within an emerging cyberinfrastructure; we will research three technologies for building a dynamic lexicon, as well as the processes required to automatically create such a reference work for any textual collection. Our efforts will focus on parallel text analysis ? word sense induction and disambiguation ? as well as syntactic parsing. These technologies will enable us to create a reference work that lists the possible senses for a word while also providing syntactic information and statistical data about its use in a corpus. The methods we use to create this work will let users search a text not only by word form, but also by word sense, syntactic subcategorization and selectional preference. Our main contribution will be the steps that any digital library needs to take to dynamically create a reference work of their own and interface it with the texts in their collection.
Associated Products
Building a Dynamic Lexicon from a Digital Library (Conference Paper/Presentation)Title: Building a Dynamic Lexicon from a Digital Library
Author: Gregory Crane
Author: David Bamman
Abstract: We describe here in detail our work toward creating a dynamic
lexicon from the texts in a large digital library. By
leveraging a small structured knowledge source (a 30,457
word treebank), we are able to extract selectional preferences
for words from a 3.5 million word Latin corpus. This
is promising news for low-resource languages and digital collections
seeking to leverage a small human investment into
much larger gain. The library architecture in which this
work is developed allows us to query customized subcorpora
to report on lexical usage by author, genre or era and allows
us to continually update the lexicon as new texts are added
to the collection.
Date: 06/01/08
Primary URL:
http://dx.doi.org/10.1145/1378889.1378892Primary URL Description: A link to the final version of this paper published in the ACM Digital Library.
Secondary URL:
http://hdl.handle.net/10427/42686Secondary URL Description: A link to a preprint of this paper deposited into the Tufts Digital Library.
Conference Name: Joint Conference on Digital Libraries
Measuring Historical Word Sense Variation (Conference Paper/Presentation)Title: Measuring Historical Word Sense Variation
Author: Gregory Crane
Author: David Bamman
Abstract: We describe here a method for automatically identifying
word sense variation in a dated collection of historical books
in a large digital library. By leveraging a small set of known
translation book pairs to induce a bilingual sense inventory
and labeled training data for a WSD classifier, we are able to
automatically classify the Latin word senses in a 389 million
word corpus and track the rise and fall of those senses over
a span of two thousand years. We evaluate the performance
of seven different classifiers both in a tenfold test on 83,892
words from the aligned parallel corpus and on a smaller,
manually annotated sample of 525 words, measuring both
the overall accuracy of each system and how well that accuracy
correlates (via mean square error) to the observed
historical variation.
Date: 06/01/11
Primary URL:
http://dx.doi.org/10.1145/1998076.1998078Primary URL Description: A link to the final published version in the ACM Digital Library database.
Secondary URL:
http://www.perseus.tufts.edu/publications/bamman-11.pdfSecondary URL Description: A link to a preprint of this paper on the Perseus Digital Library website.
Conference Name: JCDL '11 Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
The Ancient Greek and Latin Dependency Treebanks (Book Section)Title: The Ancient Greek and Latin Dependency Treebanks
Author: Gregory Crane
Author: David Bamman
Editor: Kalliopi Zervanou
Editor: Caroline Sporleder
Editor: Antal Bosch
Abstract: This paper describes the development, composition, and several uses of
the Ancient Greek and Latin Dependency Treebanks, large collections of Classical
texts in which the syntactic, morphological and lexical information for each
word is made explicit. To date, over 200 individuals from around the world have
collaborated to annotate over 350,000 words, including the entirety of Homer’s Iliad
and Odyssey, Sophocles’ Ajax, all of the extant works of Hesiod and Aeschylus,
and selections from Caesar, Cicero, Jerome, Ovid, Petronius, Propertius, Sallust and
Vergil. While perhaps the most straightforward value of such an annotated corpus
for Classical philology is the morphosyntactic searching it makes possible, it also
enables a large number of downstream tasks as well, such as inducing the syntactic
behavior of lexemes and automatically identifying similar passages between texts.
Year: 2011
Primary URL:
http://dx.doi.org/10.1007/978-3-642-20227-8_5Primary URL Description: Link to final published version in SpringerLink.
Secondary URL:
http://nlp.perseus.tufts.edu/docs/latech.pdfSecondary URL Description: Link to open access version on the Perseus Digital Library website.
Access Model: Open access copy is available.
Publisher: Springer Berlin Heidelberg
Book Title: Language Technology for Cultural Heritage
ISBN: 978-3-642-2022
Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection (Conference Paper/Presentation)Title: Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection
Author: David Bamman
Author: Alison Babeu
Author: Gregory Crane
Abstract: We present here a method for automatically projecting structural
information across translations, including canonical citation
structure (such as chapters and sections), speaker information,
quotations, markup for people and places, and
any other element in TEI-compliant XML that delimits spans
of text that are linguistically symmetrical in two languages.
We evaluate this technique on two datasets, one containing
perfectly transcribed texts and one containing errorful
OCR, and achieve an accuracy rate of 88.2% projecting
13,023 XML tags from source documents to their transcribed
translations, with an 83.6% accuracy rate when projecting
to texts containing uncorrected OCR. This approach has
the potential to allow a highly granular multilingual digital
library to be bootstrapped by applying the knowledge contained
in a small, heavily curated collection to a much larger
but unstructured one.
Date: 06/01/10
Primary URL:
http://dx.doi.org/10.1145/1816123.1816126Primary URL Description: A link to the final published version of this paper in the ACM Digital Library.
Secondary URL:
http://hdl.handle.net/10427/70398Secondary URL Description: A link to an archived version of this paper in the Tufts Digital Library.
Conference Name: JCDL '10 Proceedings of the 10th annual joint conference on Digital libraries
Prizes
Vannevar Bush Best Paper Award
Date: 6/1/2010
Organization: Joint Conference on Digital Libraries
An Ownership Model of Annotation: The Ancient Greek Dependency Treebank (Conference Paper/Presentation)Title: An Ownership Model of Annotation: The Ancient Greek Dependency Treebank
Author: David Bamman
Author: Francesco Mambrini
Author: Gregory Crane
Abstract: We describe here the first release of the Ancient Greek Dependency Treebank (AGDT), a 190,903-word syntactically annotated corpus of literary texts including the works of Hesiod, Homer and Aeschylus. While the far larger works of Hesiod and Homer (142,705 words) have been annotated under a standard treebank production method of soliciting annotations from two independent reviewers and then reconciling their differences, we also put forth with Aeschylus (48,198 words) a new model of treebank production that draws on the methods of classical philology to take into account the personal
responsibility of the annotator in the publication and ownership of a
“scholarly” treebank.
Date: 11/01/2009
Primary URL:
http://hdl.handle.net/10427/70399Primary URL Description: Link to copy of paper deposited in Tufts Digital Archive.
Conference Name: ighth International Workshop on Treebanks and Linguistic Theories Conference (TLT-8)
Extracting Two Thousand Years of Latin from a Million Book Library (Article)Title: Extracting Two Thousand Years of Latin from a Million Book Library
Author: David Bamman
Author: David Smith
Abstract: With the rise of large open digitization projects such as the Internet Archive and Google Books,
we are witnessing an explosive growth in the number of source texts becoming available to researchers
in historical languages. The Internet Archive alone contains over 27,014 texts catalogued as Latin,
including classical prose and poetry written under the Roman Empire, ecclesiastical treatises from
the Middle Ages, and dissertations from 19th-century Germany written - in Latin - on the philosophy of Hegel.
At one billion words, this collection eclipses the extant corpus of Classical Latin by several orders
of magnitude. In addition, the much larger collection of books in English, German, French, and other
languages already scanned contains unknown numbers of translations for many Latin books, or parts of
books. The sheer scale of this collection offers a broad vista of new research questions,
and we focus here on both the opportunities and challenges of computing over such a
large space of heterogeneous texts. The works in this massive collection do not constitute a
nearly curated (or much less balanced) corpus of Latin; it is, instead, simply all the Latin that
can be extracted, and in its reach of twenty-one centuries (from ca. 200 BCE to 1922 CE) arguably
spans the greatest historical distance of any major textual collection today. While we might
hope that the size and historical reach of this collection can eventually offer insight into
grand questions such as the evolution of a language over both time and space, we must contend
as well with the noise inherent in a corpus that has been assembled with minimal human intervention.
Year: 2012
Primary URL:
http://nlp.perseus.tufts.edu/docs/etc/jocch.pdfPrimary URL Description: This paper is currently under review and should be published in 2012.
Format: Journal
Periodical Title: Journal of Computing and Cultural Heritage
The Dynamic Lexicon (Database/Archive/Digital Edition)Title: The Dynamic Lexicon
Author: David Bamman
Abstract: The published form of the Dynamic Lexicon includes automatically generated lexical entries along with the underlying intermediate analysis used to generate them (including word-level alignments between source texts and their translations, and automatic morphological tagging and syntactic analysis for the Greek and Latin originals).
Year: 2011
Primary URL:
http://nlp.perseus.tufts.edu/lexicon/Primary URL Description: A description of the Dynamic Lexicon and downloads of the data.
Access Model: All data is licensed under a Creative Commons Attribution-Sharealike license.