Search Criteria

 






Key Word Search by:









Organization Type


State or Jurisdiction


Congressional District





help

Division or Office
help

Grants to:


Date Range Start


Date Range End


  • Special Searches




    Product Type


    Media Coverage Type








 


Search Results

Grant number like: HD-50794-09

Permalink for this Search

1
Page size:
 1 items in 1 pages
Award Number Grant ProgramAward RecipientProject TitleAward PeriodApproved Award Total
1
Page size:
 1 items in 1 pages
HD-50794-09Digital Humanities: Digital Humanities Start-Up GrantsUniversity of Massachusetts, AmherstOCRonym: Entity Extraction and Retrieval for Scanned Books9/1/2009 - 8/31/2010$50,000.00James AllanDavid SmithUniversity of Massachusetts, AmherstAmherstMA01003-9242USA2009Library ScienceDigital Humanities Start-Up GrantsDigital Humanities500000500000

Development of an extraction and retrieval system for named entities-people, places, and organizations-located across a large number of documents in order to use the system to track Optical Character Recognition (OCR) error rates in an effort to improve "noisy" OCR.

In the past five years, massive book-scanning projects have produced an explosion in the number of sources for the humanities, available on-line to the broadest possible audiences. Transcribing page images by optical character recognition makes many searching and browsing tasks practical for scholars. But even low OCR error rates compound into high probability of error in a given sentence, and the error rate is even higher for names. We propose to build a prototype system for information extraction and retrieval of noisy OCR. In particular, we will optimize the extraction and retrieval of names, which are highly informative features for detecting topics and events in documents. We will build statistical models of characters and words from scanned books to improve lexical coverage, and we will improve name categorization and disambiguation by linking document contexts to external sources such as Wikipedia. Our testbed comes from over one million scanned books from the Internet Archive.