[Return to Query]
Alignment-Based Training for Detecting Reader Annotations in Printed Books (Conference Paper/Presentation)
Title: Alignment-Based Training for Detecting Reader Annotations in Printed Books
Author: Soumya Mohanty
Author: David A. Smith
Abstract: Digitized books preserve not only the printed text on the page but also the marks of readers' past engagement. These reader annotations include extra text such as handwritten notes or corrections, underlines, highlights, brackets, delimiters, etc. Building a model that could automatically detect and localize these annotations can allow us to trace which passages interested readers and what additions if any, they made to the text.
...
[exceeds 2000 characters]
Date: 5/8/2019
Primary URL: http://datech.digitisation.eu/programme/schedule/
Primary URL Description: conference website
Conference Name: Digital Access to Textual Cultural Heritage (DATeCH)
Detecting de minimis Code-Switching in Historical German Books (Conference Paper/Presentation)
Title: Detecting de minimis Code-Switching in Historical German Books
Author: Shijia Liu
Author: David A. Smith
Abstract: Code-switching has long interested linguists, with computational work in particular focusing on speech and social media data (Sitaram et al., 2019). This paper contrasts these informal instances of code-switching to its appearance in more formal registers, by examining the mixture of languages in the Deutsches Textarchiv (DTA), a corpus of 1406 primarily German books from the 17th to 19th centuries. We automatically annotate and manually inspect spans of six embedded languages (Latin, French, English, Italian, Spanish, and Greek) in the corpus. We quantitatively analyze the differences between code-switching patterns in these books and those in more typically studied speech and social media corpora. Furthermore, we address the practical task of predicting code-switching from features of the matrix language alone in the DTA corpus. Such classifiers can help reduce errors when optical character recognition or speech transcription is applied to a large corpus with rare embedded languages.
Date: 12/8/2020
Primary URL: https://www.aclweb.org/anthology/2020.coling-main.163/
Primary URL Description: Archival version of the paper in the ACL Anthology.
Conference Name: International Conference on Computational Linguistics (COLING)
Content-based models of quotation (Conference Paper/Presentation)
Title: Content-based models of quotation
Author: Ansel MacLaughlin
Author: David A. Smith
Abstract: We explore the task of quotability identification, in which, given a document, we aim to identify which of its passages are the most quotable, i.e. the most likely to be directly quoted by later derived documents. We approach quotability identification as a passage ranking problem and evaluate how well both feature-based and BERT-based (Devlin et al., 2019) models rank the passages in a given document by their predicted quotability. We explore this problem through evaluations on five datasets that span multiple languages (English, Latin) and genres of literature (e.g. poetry, plays, novels) and whose corresponding derived documents are of multiple types (news, journal articles). Our experiments confirm the relatively strong performance of BERT-based models on this task, with the best model, a RoBERTA sequential sentence tagger, achieving an average ρ of 0.35 and NDCG@1, 5, 50 of 0.26, 0.31 and 0.40, respectively, across all five datasets.
Date: 4/19/2021
Primary URL: http://ccs.neu.edu/home/dasmith/maclaughlin-eacl-2021.pdf
Primary URL Description: Preprint
Conference Name: Conference of the European Chapter of the Association for Computational Linguistics (EACL)
Digital Editions as Distant Supervision for Layout Analysis of Printed Books (Conference Paper/Presentation)
Title: Digital Editions as Distant Supervision for Layout Analysis of Printed Books
Author: Alejandro H. Toselli
Author: Si Wu
Author: David A. Smith
Abstract: Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents’ semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
Date: 9/8/2021
Conference Name: International Conference on Document Analysis and Recognition
Permalink: https://apps.neh.gov/publicquery/products.aspx?gn=HAA-263837-19