Understanding Genre in a Collection of a Million Volumes
FAIN: HD-51787-13
Board of Trustees of the University of Illinois (Champaign, IL 61801-3620)
William Underwood (Project Director: October 2012 to August 2016)
The continuing development of software that would allow users to classify digitized literary works by genre, including allowing for the changing definitions of genre over time.
Large digital collections offer new avenues of exploration for literary scholars. But their potential has not yet been fully realized, because we don’t have the metadata we would need to make literary arguments at scale. Subject classifications don’t reveal, for instance, whether a given volume is poetry, drama, fiction, or criticism. Working with a hand-classified collection of 4,275 English-language works, we have discovered new perspectives on the history of genre. But to flesh out those leads (and permit others to undertake similar projects) we need to move to a scale where manual classification would be impractical. We propose to develop software that can classify volumes by genre while allowing definitions of genre to change over time, and allowing works to belong to multiple genres. We will classify a million-volume collection (1800- 1949), make our data, metadata, and software freely available through HathiTrust Research Center, and publish substantive literary findings.
Media Coverage
Looking at a Dataset for Distant Reading (Review)
Author(s): Forster, Chris
Date: 8/10/2015
Abstract: Shows readers how to use the data in our release, and works through basic descriptive statistics on the fiction portion of the data.
URL: http://cforster.com/2015/08/exploring-hathitrust-dataset/
Extracted Features in the Wild (Review)
Publication: HathiTrust Research Center Wiki
Date: 7/8/2016
Abstract: A round-up of several other projects relying on the dataset produced by this NEH grant.
URL: https://wiki.htrc.illinois.edu/display/COM/Extracted+Features+in+the+Wild
Associated Products
"Mapping Mutable Genres in Structurally Complex Volumes" (Article)Title: "Mapping Mutable Genres in Structurally Complex Volumes"
Author: Underwood, Ted
Author: Black, Michael L.
Author: Auvil, Loretta
Author: Capitanu, Boris
Abstract: To mine large digital libraries in humanistically meaningful ways, we need to divide them by genre. This is a task that classification algorithms are well suited to assist, but they need adjustment to address the specific challenges of this domain. Digital libraries pose two problems of scale not usually found in the article datasets used to test these algorithms. 1) Because libraries span several centuries, the genres being identified may change gradually across the time axis. 2) Because volumes are much longer than articles, they tend to be internally heterogeneous, and the classification task also requires segmentation. We describe a multilayered solution that trains hidden Markov models to segment volumes, and uses ensembles of overlapping classifiers to address historical change. We demonstrate this on a collection of 469,200 volumes drawn from HathiTrust Digital Library.
Year: 2013
Primary URL:
http://arxiv.org/abs/1309.3323Access Model: open access
Format: Journal
Periodical Title: Proceedings of the IEEE
Page-Level Genre Metadata for English-Language Volumes in HathiTrust, 1700-1922 (Database/Archive/Digital Edition)Title: Page-Level Genre Metadata for English-Language Volumes in HathiTrust, 1700-1922
Author: Underwood, Ted
Abstract: Page-by-page genre predictions for 854,476 English-language volumes printed between 1700 and 1922, keyed to the texts in HathiTrust Digital Library. This research was supported by the National Endowment for the Humanities and the American Council of Learned Societies.
The genre predictions were produced by an ensemble of regularized logistic classifiers, and are intended to support research that explores broad trends in literary history. Since volumes usually contain multiple genres, page-level metadata is necessary to create machine-readable collections in a particular genre.
Year: 2014
Primary URL:
https://figshare.com/articles/Page_Level_Genre_Metadata_for_English_Language_Volumes_in_HathiTrust_1700_1922/1279201Primary URL Description: Figshare repository holds many different data files, listing volumes in literary genres, and also characterizing those volumes at a page level.
Access Model: Open access.
A Dataset for Distant-Reading Literature in English. (Blog Post)Title: A Dataset for Distant-Reading Literature in English.
Author: Underwood, Ted
Abstract: In collaboration with HathiTrust Research Center, the author presents a collection of page-level word counts for English-language volumes in poetry, drama, and fiction.
Date: 03/01/2015
Primary URL:
https://tedunderwood.com/2015/08/07/a-dataset-for-distant-reading-literature-in-english-1700-1922/Website: The Stone and the Shell
Word Frequencies in English-Language Literature (Database/Archive/Digital Edition)Title: Word Frequencies in English-Language Literature
Author: Underwood, Ted
Author: Capitanu, Boris
Author: Organisciak, Peter
Author: Auvil, Loretta
Author: Bhattacharyya, Sayan
Author: Fallaw, Colleen
Author: Downie, J. Stephen
Abstract: Word frequencies for volumes of English-language literature, based on the metadata generated by Ted Underwood's NEH grant.
Year: 2015
Primary URL:
https://analytics.hathitrust.org/genreAccess Model: Open access.
The Longue Durée of Literary Prestige (Article)Title: The Longue Durée of Literary Prestige
Author: Underwood, Ted
Author: Sellers, Jordan
Abstract: The data used in this article was generated by my NEH grant.
A history of literary prestige needs to study both works that achieved distinction, and the mass of volumes from which they were distinguished. To understand how those patterns of preference changed across a century, we gathered two samples of English-language poetry, 1820-1919: one drawn from volumes reviewed in prominent periodicals, and one selected at random from a large digital library (where the majority of authors are relatively obscure). The stylistic differences associated with literary prominence turn out to be quite stable: a statistical model trained to distinguish reviewed from random volumes in any quarter of this century can make predictions almost as accurate about the rest of the period. The “poetic revolutions” described by many histories are not visible in this model — instead we see a steady tendency for new volumes of poetry to change by slightly exaggerating certain features that defined prestige in the recent past.
Year: 2016
Access Model: Subscription plus green open access after publication.
Format: Journal
Publisher: Modern Language Quarterly