Automatic Extraction of Article Metadata from Digitized Historical Newspapers
FAIN: HD-50099-07
Drexel University (Philadelphia, PA 19104-2875)
Robert B. Allen (Project Director: November 2006 to December 2009)
The development of a programming tool for automatically identifying, categorizing, and describing newspaper articles from digital files produced by the National Digital Newspaper Program (NDNP).
In the next few years, images of several hundred thousand pages will be digitized and available online through the National Digital Newspaper Program. While the digitization process typically includes identification of the words in the text using basic optical character recognition (OCR), the identification and indexing of articles is not required of the project awardees. Articles are the natural unit for interacting with the news. Knowing the articles can improve search accuracy and support user-friendly interaction and it should increase the value of the material for historians, teachers of history, and members of the public who are interested in history. We will develop automated methods for such article-level processing. Specifically we will build a set of Java programs that will use the image files and the OCR files as input and will identify, categorize, and extract descriptions from articles.