Emmanuel Giguet

2022

pdf abs
Réinterroger l’édition numérique et la consultation d’oeuvres anciennes : traçabilité, accessibilité, interprétabilité
Emmanuel Giguet | Julia Roger
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Atelier TAL et Humanités Numériques (TAL-HN)

Dans le domaine des humanités numériques et de l’édition d’oeuvres anciennes, l’influence de la Text Encoding Initiative (TEI) a porté ses fruits et n’est plus à démontrer. Le contexte technologique est cependant propice à l’émergence de nouveaux modes de consultation et de diffusion. Nous nous appuierons sur la création d’une nouvelle interface de consultation des oeuvres de Descartes pour traiter des questions de traçabilité des opérations, d’interopérabilité des ressources de TAL, et d’interprétabilité.

pdf abs
GREYC@FinTOC-2022: Handling Document Layout and Structure in Native PDF Bundle of Documents
Emmanuel Giguet | Nadine Lucas
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

n this paper, we present our contribution to the FinTOC-2022 Shared Task “Financial Document Structure Extraction”. We participated in the three tracks dedicated to English, French and Spanish document processing. Our main contribution consists in considering financial prospectus as a bundle of documents, i.e., a set of merged documents, each with their own layout and structure. Therefore, Document Layout and Structure Analysis (DLSA) first starts with the boundary detection of each document using general layout features. Then, the process applies inside each single document, taking advantage of the local properties. DLSA is achieved considering simultaneously text content, vectorial shapes and images embedded in the native PDF document. For the Title Detection task in English and French, we observed a significant improvement of the F-measures for Title Detection compared with those obtained during our previous participation.

2021

pdf
Daniel@FinTOC-2021: Taking Advantage of Images and Vectorial Shapes in Native PDF Document Analysis
Emmanuel Giguet | Gaël Lejeune
Proceedings of the 3rd Financial Narrative Processing Workshop

2020

pdf abs
Daniel@FinTOC’2 Shared Task: Title Detection and Structure Extraction
Emmanuel Giguet | Gaël Lejeune | Jean-Baptiste Tanguy
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

We present our contributions for the 2020 FinTOC Shared Tasks: Title Detection and Table of Contents Extraction. For the Structure Extraction task, we propose an approach that combines information from multiple sources: the table of contents, the wording of the document, and lexical domain knowledge. For the title detection task, we compare surface features to character-based features on various training configurations. We show that title detection results are very sensitive to the kind of training dataset used.

pdf
Daniel at the FinSBD-2 Task: Extracting List and Sentence Boundaries from PDF Documents, a model-driven approach to PDF document analysis
Emmanuel Giguet | Gaël Lejeune
Proceedings of the Second Workshop on Financial Technology and Natural Language Processing

This paper presents a robust system for deep syntactic parsing of unrestricted French. This system uses techniques from Part-of-Speech tagging in order to build a constituent structure and uses other techniques from dependency grammar in an original framework of memories in order to build a functional structure. The two structures are build simultaneously by two interacting processes. The processes share the same aim, that is, to recover efficiently and reliably syntactic information with no explicit expectation on text structure.

pdf bib
Syntactic Structures of Sentences from Large Corpora
Emmanuel Giguet | Jacques Vergne
Fifth Conference on Applied Natural Language Processing: Descriptions of System Demonstrations and Videos