Maud Ehrmann

2024

pdf abs
Post-Correction of Historical Text Transcripts with Large Language Models: An Exploratory Study
Emanuela Boros | Maud Ehrmann | Matteo Romanello | Sven Najem-Meyer | Frédéric Kaplan
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

The quality of automatic transcription of heritage documents, whether from printed, manuscripts or audio sources, has a decisive impact on the ability to search and process historical texts. Although significant progress has been made in text recognition (OCR, HTR, ASR), textual materials derived from library and archive collections remain largely erroneous and noisy. Effective post-transcription correction methods are therefore necessary and have been intensively researched for many years. As large language models (LLMs) have recently shown exceptional performances in a variety of text-related tasks, we investigate their ability to amend poor historical transcriptions. We evaluate fourteen foundation language models against various post-correction benchmarks comprising different languages, time periods and document types, as well as different transcription quality and origins. We compare the performance of different model sizes and different prompts of increasing complexity in zero and few-shot settings. Our evaluation shows that LLMs are anything but efficient at this task. Quantitative and qualitative analyses of results allow us to share valuable insights for future work on post-correcting historical texts with LLMs.

2020

pdf abs
Language Resources for Historical Newspapers: the Impresso Collection
Maud Ehrmann | Matteo Romanello | Simon Clematide | Phillip Benjamin Ströbel | Raphaël Barman
Proceedings of the Twelfth Language Resources and Evaluation Conference

Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents.

2017

pdf
Book Review: Linked Lexical Knowledge Bases Foundations and Applications by Iryna Gurevych, Judith Eckle-er and Michael Matuschek
Maud Ehrmann
Computational Linguistics, Volume 43, Issue 2 - June 2017

2016

pdf abs
Cross-lingual Linking of Multi-word Entities and their corresponding Acronyms
Guillaume Jacquet | Maud Ehrmann | Ralf Steinberger | Jaakko Väyrynen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper reports on an approach and experiments to automatically build a cross-lingual multi-word entity resource. Starting from a collection of millions of acronym/expansion pairs for 22 languages where expansion variants were grouped into monolingual clusters, we experiment with several aggregation strategies to link these clusters across languages. Aggregation strategies make use of string similarity distances and translation probabilities and they are based on vector space and graph representations. The accuracy of the approach is evaluated against Wikipedia’s redirection and cross-lingual linking tables. The resulting multi-word entity resource contains 64,000 multi-word entities with unique identifiers and their 600,000 multilingual lexical variants. We intend to make this new resource publicly available.

pdf abs
Named Entity Resources - Overview and Outlook
Maud Ehrmann | Damien Nouvel | Sophie Rosset
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Recognition of real-world entities is crucial for most NLP applications. Since its introduction some twenty years ago, named entity processing has undergone a significant evolution with, among others, the definition of new tasks (e.g. entity linking) and the emergence of new types of data (e.g. speech transcriptions, micro-blogging). These pose certainly new challenges which affect not only methods and algorithms but especially linguistic resources. Where do we stand with respect to named entity resources? This paper aims at providing a systematic overview of named entity resources, accounting for qualities such as multilingualism, dynamicity and interoperability, and to identify shortfalls in order to guide future developments.

2014

The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web page http://emm.newsbrief.eu/overview.html.

pdf abs
Clustering of Multi-Word Named Entity variants: Multilingual Evaluation
Guillaume Jacquet | Maud Ehrmann | Ralf Steinberger
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Multi-word entities, such as organisation names, are frequently written in many different ways. We have previously automatically identified over one million acronym pairs in 22 languages, consisting of their short form (e.g. EC) and their corresponding long forms (e.g. European Commission, European Union Commission). In order to automatically group such long form variants as belonging to the same entity, we cluster them, using bottom-up hierarchical clustering and pair-wise string similarity metrics. In this paper, we address the issue of how to evaluate the named entity variant clusters automatically, with minimal human annotation effort. We present experiments that make use of Wikipedia redirection tables and we show that this method produces good results.

pdf abs
Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0
Maud Ehrmann | Francesco Cecconi | Daniele Vannella | John Philip McCrae | Philipp Cimiano | Roberto Navigli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Recent years have witnessed a surge in the amount of semantic information published on the Web. Indeed, the Web of Data, a subset of the Semantic Web, has been increasing steadily in both volume and variety, transforming the Web into a ‘global database’ in which resources are linked across sites. Linguistic fields – in a broad sense – have not been left behind, and we observe a similar trend with the growth of linguistic data collections on the so-called ‘Linguistic Linked Open Data (LLOD) cloud’. While both Semantic Web and Natural Language Processing communities can obviously take advantage of this growing and distributed linguistic knowledge base, they are today faced with a new challenge, i.e., that of facilitating multilingual access to the Web of data. In this paper we present the publication of BabelNet 2.0, a wide-coverage multilingual encyclopedic dictionary and ontology, as Linked Data. The conversion made use of lemon, a lexicon model for ontologies particularly well-suited for this enterprise. The result is an interlinked multilingual (lexical) resource which can not only be accessed on the LOD, but also be used to enrich existing datasets with linguistic information, or to support the process of mapping datasets across languages.

Dans cet article nous relatons notre participation à la campagne d’évaluation ESTER 2 (Evaluation des Systèmes de Transcription Enrichie d’Emissions Radiophoniques). Après avoir décrit les objectifs de cette campagne ainsi que ses spécificités et difficultés, nous présentons notre système d’extraction d’entités nommées en nous focalisant sur les adaptations réalisées dans le cadre de cette campagne. Nous décrivons ensuite les résultats obtenus lors de la compétition, ainsi que des résultats originaux obtenus par la suite. Nous concluons sur les leçons tirées de cette expérience.

2009

pdf abs
Proposition de caractérisation et de typage des expressions temporelles en contexte
Maud Ehrmann | Caroline Hagège
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous assistons actuellement en TAL à un regain d’intérêt pour le traitement de la temporalité véhiculée par les textes. Dans cet article, nous présentons une proposition de caractérisation et de typage des expressions temporelles tenant compte des travaux effectués dans ce domaine tout en cherchant à pallier les manques et incomplétudes de certains de ces travaux. Nous explicitons comment nous nous situons par rapport à l’existant et les raisons pour lesquelles parfois nous nous en démarquons. Le typage que nous définissons met en évidence de réelles différences dans l’interprétation et le mode de résolution référentielle d’expressions qui, en surface, paraissent similaires ou identiques. Nous proposons un ensemble des critères objectifs et linguistiquement motivés permettant de reconnaître, de segmenter et de typer ces expressions. Nous verrons que cela ne peut se réaliser sans considérer les procès auxquels ces expressions sont associées et un contexte parfois éloigné.

Nous présentons une expérience de fusion d’annotations d’entités nommées provenant de différents annotateurs. Ce travail a été réalisé dans le cadre du projet Infom@gic, projet visant à l’intégration et à la validation d’applications opérationnelles autour de l’ingénierie des connaissances et de l’analyse de l’information, et soutenu par le pôle de compétitivité Cap Digital « Image, MultiMédia et Vie Numérique ». Nous décrivons tout d’abord les quatre annotateurs d’entités nommées à l’origine de cette expérience. Chacun d’entre eux fournit des annotations d’entités conformes à une norme développée dans le cadre du projet Infom@gic. L’algorithme de fusion des annotations est ensuite présenté ; il permet de gérer la compatibilité entre annotations et de mettre en évidence les conflits, et ainsi de fournir des informations plus fiables. Nous concluons en présentant et interprétant les résultats de la fusion, obtenus sur un corpus de référence annoté manuellement.

pdf abs
Vers une méthodologie d’annotation des entités nommées en corpus ?
Karën Fort | Maud Ehrmann | Adeline Nazarenko
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

La tâche, aujourd’hui considérée comme fondamentale, de reconnaissance d’entités nommées, présente des difficultés spécifiques en matière d’annotation. Nous les précisons ici, en les illustrant par des expériences d’annotation manuelle dans le domaine de la microbiologie. Ces problèmes nous amènent à reposer la question fondamentale de ce que les annotateurs doivent annoter et surtout, pour quoi faire. Nous identifions pour cela les applications nécessitant l’extraction d’entités nommées et, en fonction des besoins de ces applications, nous proposons de définir sémantiquement les éléments à annoter. Nous présentons ensuite un certain nombre de recommandations méthodologiques permettant d’assurer un cadre d’annotation cohérent et évaluable.

pdf
Towards a Methodology for Named Entities Annotation
Karën Fort | Maud Ehrmann | Adeline Nazarenko
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf
Résolution de métonymie des entités nommées : proposition d’une méthode hybride [Metonymy resolution for named entities: an hybrid approach]
Caroline Brun | Maud Ehrmann | Guillaume Jacquet
Traitement Automatique des Langues, Volume 50, Numéro 1 : Varia [Varia]

2008

pdf abs
Résolution de Métonymie des Entités Nommées : proposition d’une méthode hybride
Caroline Brun | Maud Ehrmann | Guillaume Jacquet
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous décrivons la méthode que nous avons développée pour la résolution de métonymie des entités nommées dans le cadre de la compétition SemEval 2007. Afin de résoudre les métonymies sur les noms de lieux et noms d’organisation, tel que requis pour cette tâche, nous avons mis au point un système hybride basé sur l’utilisation d’un analyseur syntaxique robuste combiné avec une méthode d’analyse distributionnelle. Nous décrivons cette méthode ainsi que les résultats obtenus par le système dans le cadre de la compétition SemEval 2007.