Leveraging "The Wisdom of the Crowds" for Efficient Tagging and Retrieval of documents from the Historic Newspaper Archive
FAIN: HD-51153-10
Columbia University (New York, NY 10027-7922)
Haimonti Dutta (Project Director: March 2010 to August 2013)
A study of user-generated subject tagging to improve search capabilities for large-scale digital archives of humanities materials, using the historic newspaper collections of the New York Public Library.
Computers may have defeated humans in chess and arithmetic, but there are many areas where the human mind still excels such as visual cognition and language processing (Comm. of ACM, Vol 52, No 3, March 2009). If one mind is good, it has been argued that several minds are likely to be superior in certain tasks than individuals and even experts. This project aims to leverage the wisdom of the crowds (von Ahn, 2008) to collaboratively tag historical newspaper articles in the holdings of the New York Public Library (NYPL). Patrons and scholars will be encouraged to generate custom tags for articles they read and use often; these will be integrated into a meta-data library and evaluated for their contribution to improving retrieval performance. The text in the newspaper articles along with user-generated tags will be subjected to statistical analysis and machine learning for automatic categorization.
Associated Products
Learning Parameters of the K-Means Algorithm from Subjective Human Annotation (Conference Paper/Presentation)Title: Learning Parameters of the K-Means Algorithm from Subjective Human Annotation
Author: Barbara Taranto
Author: Haimonti Dutta
Author: Rebecca J. Passonneau
Author: Austin Lee
Author: Axinia Radeva
Author: Boyi Xie
Author: David Waltz
Abstract: The New York Public Library is participating in the Chronicling America initiative to develop an online searchable database of historically significant newspaper articles. Microfilm copies of the papers are scanned and high resolution OCR software is run on them. The text from the OCR provides a wealth of data and opinion for researchers and historians. However, the categorization of articles provided by the OCR engine is rudimentary and a large number of the articles are labeled editorial without further categorization. To provide a more refined grouping of articles, unsupervised machine learning algorithms (such as K-Means) are being investigated. The K-Means algorithm requires tuning of parameters such as the number of clusters and mechanism of seeding to ensure that the search is not prone to being caught in a local minima. We designed a pilot study to observe whether humans are adept at finding sub-categories. The subjective labels provided by humans are used as a guide to compare performance of the automated clustering techniques. In addition, seeds provided by annotators are carefully incorporated into a semi-supervised K-Means algorithm (Seeded K-Means); empirical results indicate that this helps to improve performance and provides an intuitive sub-categorization of the articles labeled ``editorial" by the OCR engine.
Date: 05/20/2011
Primary URL:
http://www1.ccls.columbia.edu/~dutta/flairs.pdfConference Name: The 24th International FLAIRS Conference, Special Track on Data Mining
Topic Identification from Historic Newspaper Articles of the New York Public Library : A Case Study (Conference Paper/Presentation)Title: Topic Identification from Historic Newspaper Articles of the New York Public Library : A Case Study
Author: Barbara Taranto
Author: Austin Lee
Author: Haimonti Dutta
Author: Rebecca J. Passonneau
Author: David Waltz
Abstract: Chronicling America is an initiative of the National Endowment for Humanities (NEH) and the Library of Congress (LC) whose goal is to develop an online, searchable database of historically significant newspapers between 1836 and 1922. The New York Public Library (NYPL) is part of the initiative to participate in the first phase of the National Digital Newspaper Program (NDNP). Between 2005 - 2009, it has scanned 200,000 newspaper pages published between 1890 - 1920 from microfilm. The goal of this research project is to enable users of the historical archive including scholars and other library patrons to efficiently search for relevant items and tag articles of interest to them. Unfortunately, the current search facilities are rudimentary and irrelevant documents are often more highly ranked than relevant ones. The newspapers are scanned on a page-by-page basis and article level segmentation is poor or non-existent; the OCR scanning process is far from perfect and the documents generated from it contain a large amount of garbled text. Our goal is to apply state-of-the-art text mining and machine learning algorithms for increasing retrieval performance.
Date: 10/22/2010
Conference Name: 5th Annual Machine Learning Symposium, New York Academy of Science (NYAS), 2010
A Tool for Visualizing and Reading Historic Newspaper Articles of NYPL (Computer Program)Title: A Tool for Visualizing and Reading Historic Newspaper Articles of NYPL
Author: Austin Lee
Abstract: The historic newspapers of the NYPL have been scanned from microfilm producing OCR text, XML metadata and images. We developed a tool to visualize and read articles in the newspaper scanning multiple pages, extract the text from the articles, obtain a bag-of-words representation and use it for further sophisticated text analysis.
Year: 2010
Programming Language/Platform: Java
Source Available?: Yes
Using community structure detection to rank annotators when ground truth is subjective (Conference Paper/Presentation)Title: Using community structure detection to rank annotators when ground truth is subjective
Author: Haimonti Dutta and William Chan
Abstract: Learning using labels provided by multiple annotators has attracted a lot of
interest in the machine learning community. With the advent of crowdsourcing
cheap, noisy labels are easy to obtain. This has
raised the question of how to assess
annotator quality. Prior work uses %expectation maximization, gaussian mixture models and
bayesian inference to estimate consensus labels and obtain annotator scores based on expertise; the key assumptions are that the ground truth is known and categories of labels are predefined. In applications where it is possible to have \emph{multiple} ground truths,
assessing annotator quality is challenging since the ranking of annotators' is dependent on the choice of ground-truth. This paper describes a case-study in the context of annotating historic newspaper articles from the New York
Public Library. The goal is to assign fine-grained categorization of articles labeled ``editorial" by the Optical Character Recognition (OCR) software. The task is subjective since pre-defined categories are not available. To define the ground truth we use a Community Structure Detection (CSD) algorithm in a similarity graph formed between articles. The labels from the CSD algorithm provides the target function to be learned. Annotators labels are then viewed as \emph{related} tasks that help learn this target function. The technique helps to provide insights into how to rank annotator performance using well known information retrieval metrics.
Date: 10/07/2012
Primary URL:
http://www.cs.cornell.edu/~damoulas/Site/HCSCS.htmlConference Name: Workshop on Human Computation for Science and Computational Sustainability, NIPS 2012