2017
pdf
abs
Apprendre des représentations jointes de mots et d’entités pour la désambiguïsation d’entités (Combining Word and Entity Embeddings for Entity Linking)
José G. Moreno
|
Romaric Besançon
|
Romain Beaumont
|
Eva D’Hondt
|
Anne-Laure Ligozat
|
Sophie Rosset
|
Xavier Tannier
|
Brigitte Grau
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs
La désambiguïsation d’entités (ou liaison d’entités), qui consiste à relier des mentions d’entités d’un texte à des entités d’une base de connaissance, est un problème qui se pose, entre autre, pour le peuplement automatique de bases de connaissances à partir de textes. Une difficulté de cette tâche est la résolution d’ambiguïtés car les systèmes ont à choisir parmi un nombre important de candidats. Cet article propose une nouvelle approche fondée sur l’apprentissage joint de représentations distribuées des mots et des entités dans le même espace, ce qui permet d’établir un modèle robuste pour la comparaison entre le contexte local de la mention d’entité et les entités candidates.
pdf
abs
Generating a Training Corpus for OCR Post-Correction Using Encoder-Decoder Model
Eva D’hondt
|
Cyril Grouin
|
Brigitte Grau
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
In this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or external information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of (relatively) clean training data from a representative corpus to learn a character-based statistical language model using Bidirectional Long Short-Term Memory Networks (biLSTMs). We demonstrate the versatility and adaptability of our system on different text corpora with varying degrees of textual noise, including a real-life OCR corpus in the medical domain.
2016
pdf
abs
Detection of Text Reuse in French Medical Corpora
Eva D’hondt
|
Cyril Grouin
|
Aurélie Névéol
|
Efstathios Stamatatos
|
Pierre Zweigenbaum
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)
Electronic Health Records (EHRs) are increasingly available in modern health care institutions either through the direct creation of electronic documents in hospitals’ health information systems, or through the digitization of historical paper records. Each EHR creation method yields the need for sophisticated text reuse detection tools in order to prepare the EHR collections for efficient secondary use relying on Natural Language Processing methods. Herein, we address the detection of two types of text reuse in French EHRs: 1) the detection of updated versions of the same document and 2) the detection of document duplicates that still bear surface differences due to OCR or de-identification processing. We present a robust text reuse detection method to automatically identify redundant document pairs in two French EHR corpora that achieves an overall macro F-measure of 0.68 and 0.60, respectively and correctly identifies all redundant document pairs of interest.
pdf
Low-resource OCR error detection and correction in French Clinical Texts
Eva D’hondt
|
Cyril Grouin
|
Brigitte Grau
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis
2015
pdf
Redundancy in French Electronic Health Records: A preliminary study
Eva D’hondt
|
Xavier Tannier
|
Aurélie Névéol
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis
2014
pdf
Genre classification using Balanced Winnow in the DEFT 2014 challenge
Eva D’hondt
TALN-RECITAL 2014 Workshop DEFT 2014 : DÉfi Fouille de Textes (DEFT 2014 Workshop: Text Mining Challenge)
2013
pdf
Text Representations for Patent Classification
Eva D’hondt
|
Suzan Verberne
|
Cornelis Koster
|
Lou Boves
Computational Linguistics, Volume 39, Issue 3 - September 2013