Iana Atanassova


Logical Layout Analysis Applied to Historical Newspapers
Nicolas Gutehrlé | Iana Atanassova
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

In recent years, libraries and archives led important digitisation campaigns that opened the access to vast collections of historical documents. While such documents are often available as XML ALTO documents, they lack information about their logical structure. In this paper, we address the problem of logical layout analysis applied to historical documents. We propose a method which is based on the study of a dataset in order to identify rules that assign logical labels to both block and lines of text from XML ALTO documents. Our dataset contains newspapers in French, published in the first half of the 20th century. The evaluation shows that our methodology performs well for the identification of first lines of paragraphs and text lines, with F1 above 0.9. The identification of titles obtains an F1 of 0.64. This method can be applied to preprocess XML ALTO documents in preparation for downstream tasks, and also to annotate large-scale datasets to train machine learning and deep learning algorithms.


pdf bib
Multiple In-text Reference Aggregation Phenomenon
Marc Bertin | Iana Atanassova
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)


Extraction of Author’s Definitions Using Indexed Reference Identification
Marc Bertin | Iana Atanassova | Jean-Pierre Descles
Proceedings of the 1st Workshop on Definition Extraction