Johannes Heinecke


Knowledge Extraction From Texts Based on Wikidata
Anastasia Shimorina | Johannes Heinecke | Frédéric Herledan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track

This paper presents an effort within our company of developing knowledge extraction pipeline for English, which can be further used for constructing an entreprise-specific knowledge base. We present a system consisting of entity detection and linking, coreference resolution, and relation extraction based on the Wikidata schema. We highlight existing challenges of knowledge extraction by evaluating the deployed pipeline on real-world data. We also make available a database, which can serve as a new resource for sentential relation extraction, and we underline the importance of having balanced data for training classification models.

Étiquetage ou génération de séquences pour la compréhension automatique du langage en contexte d’interaction? (Sequence tagging or sequence generation for Natural Language Understanding ?)
Rim Abrougui | Géraldine Damnati | Johannes Heinecke | Frédéric Béchet
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

La tâche de compréhension automatique du langage en contexte d’interaction (NLU pour Natural Language Understanding) est souvent réduite à la détection d’intentions et de concepts sur des corpus mono-domaines annotés avec une seule intention par énoncé. Afin de dépasser ce paradigme, nous cherchons à aborder des référentiels plus complexes en visant des représentations sémantiques structurées au-delà du simple modèle intention/concept. Nous nous intéressons au corpus MultiWOZ, couramment utilisé pour le suivi de l’état du dialogue. Nous questionnons la projection de ces annotations sémantiques complexes pour le NLU, en comparant plusieurs approches d’étiquetage de séquence, puis en proposant un nouveau formalisme inspiré des méthodes de génération de graphe pour la modélisation sémantique AMR. Nous discutons enfin le potentiel des approches génératives.

Transfer Learning and Masked Generation for Answer Verbalization
Sebastien Montella | Lina Rojas-Barahona | Frederic Bechet | Johannes Heinecke | Alexis Nasr
Proceedings of the Workshop on Structured and Unstructured Knowledge Integration (SUKI)

Structured Knowledge has recently emerged as an essential component to support fine-grained Question Answering (QA). In general, QA systems query a Knowledge Base (KB) to detect and extract the raw answers as final prediction. However, as lacking of context, language generation can offer a much informative and complete response. In this paper, we propose to combine the power of transfer learning and the advantage of entity placeholders to produce high-quality verbalization of extracted answers from a KB. We claim that such approach is especially well-suited for answer generation. Our experiments show 44.25%, 3.26% and 29.10% relative gain in BLEU over the state-of-the-art on the VQuAnDA, ParaQA and VANiLLa datasets, respectively. We additionally provide minor hallucinations corrections in VANiLLa standing for 5% of each of the training and testing set. We witness a median absolute gain of 0.81 SacreBLEU. This strengthens the importance of data quality when using automated evaluation.

pdf bib
Multilingual Abstract Meaning Representation for Celtic Languages
Johannes Heinecke | Anastasia Shimorina
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

Deep Semantic Parsing into Abstract Meaning Representation (AMR) graphs has reached a high quality with neural-based seq2seq approaches. However, the training corpus for AMR is only available for English. Several approaches to process other languages exist, but only for high resource languages. We present an approach to create a multilingual text-to-AMR model for three Celtic languages, Welsh (P-Celtic) and the closely related Irish and Scottish-Gaelic (Q-Celtic). The main success of this approach are underlying multilingual transformers like mT5. We finally show that machine translated test corpora unfairly improve the AMR evaluation for about 1 or 2 points (depending on the language).


Hyperbolic Temporal Knowledge Graph Embeddings with Relational and Time Curvatures
Sebastien Montella | Lina M. Rojas Barahona | Johannes Heinecke
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021


Transformer based Natural Language Generation for Question-Answering
Imen Akermi | Johannes Heinecke | Frédéric Herledan
Proceedings of the 13th International Conference on Natural Language Generation

This paper explores Natural Language Generation within the context of Question-Answering task. The several works addressing this task only focused on generating a short answer or a long text span that contains the answer, while reasoning over a Web page or processing structured data. Such answers’ length are usually not appropriate as the answer tend to be perceived as too brief or too long to be read out loud by an intelligent assistant. In this work, we aim at generating a concise answer for a given question using an unsupervised approach that does not require annotated data. Tested over English and French datasets, the proposed approach shows very promising results.

pdf bib
Approche de génération de réponse à base de transformers (Transformer based approach for answer generation)
Imen Akermi | Johannes Heinecke | Frédéric Herledan
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Cet article présente une approche non-supervisée basée sur les modèles Transformer pour la génération du langage naturel dans le cadre des systèmes de question-réponse. Cette approche permettrait de remédier à la problématique de génération de réponse trop courte ou trop longue sans avoir recours à des données annotées. Cette approche montre des résultats prometteurs pour l’anglais et le français.

Hybrid Enhanced Universal Dependencies Parsing
Johannes Heinecke
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

This paper describes our system to predict enhanced dependencies for Universal Dependencies (UD) treebanks, which ranked 2nd in the Shared Task on Enhanced Dependency Parsing with an average ELAS of 82.60%. Our system uses a hybrid two-step approach. First, we use a graph-based parser to extract a basic syntactic dependency tree. Then, we use a set of linguistic rules which generate the enhanced dependencies for the syntactic tree. The application of these rules is optimized using a classifier which predicts their suitability in the given context. A key advantage of this approach is its language independence, as rules rely solely on dependency trees and UPOS tags which are shared across all languages.

Cross-lingual and Cross-domain Evaluation of Machine Reading Comprehension with Squad and CALOR-Quest Corpora
Delphine Charlet | Geraldine Damnati | Frederic Bechet | Gabriel Marzinotto | Johannes Heinecke
Proceedings of the Twelfth Language Resources and Evaluation Conference

Machine Reading received recently a lot of attention thanks to both the availability of very large corpora such as SQuAD or MS MARCO containing triplets (document, question, answer), and the introduction of Transformer Language Models such as BERT which obtain excellent results, even matching human performance according to the SQuAD leaderboard. One of the key features of Transformer Models is their ability to be jointly trained across multiple languages, using a shared subword vocabulary, leading to the construction of cross-lingual lexical representations. This feature has been used recently to perform zero-shot cross-lingual experiments where a multilingual BERT model fine-tuned on a machine reading comprehension task exclusively for English was directly applied to Chinese and French documents with interesting performance. In this paper we study the cross-language and cross-domain capabilities of BERT on a Machine Reading Comprehension task on two corpora: SQuAD and a new French Machine Reading dataset, called CALOR-QUEST. The semantic annotation available on CALOR-QUEST allows us to give a detailed analysis on the kinds of questions that are properly handled through the cross-language process. We will try to answer this question: which factor between language mismatch and domain mismatch has the strongest influence on the performances of a Machine Reading Comprehension task?

Denoising Pre-Training and Data Augmentation Strategies for Enhanced RDF Verbalization with Transformers
Sebastien Montella | Betty Fabre | Tanguy Urvoy | Johannes Heinecke | Lina Rojas-Barahona
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

The task of verbalization of RDF triples has known a growth in popularity due to the rising ubiquity of Knowledge Bases (KBs). The formalism of RDF triples is a simple and efficient way to store facts at a large scale. However, its abstract representation makes it difficult for humans to interpret. For this purpose, the WebNLG challenge aims at promoting automated RDF-to-text generation. We propose to leverage pre-trainings from augmented data with the Transformer model using a data augmentation strategy. Our experiment results show a minimum relative increases of 3.73%, 126.05% and 88.16% in BLEU score for seen categories, unseen entities and unseen categories respectively over the standard training.


CALOR-QUEST : un corpus d’entraînement et d’évaluation pour la compréhension automatique de textes (Machine reading comprehension is a task related to Question-Answering where questions are not generic in scope but are related to a particular document)
Frederic Bechet | Cindy Aloui | Delphine Charlet | Geraldine Damnati | Johannes Heinecke | Alexis Nasr | Frederic Herledan
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

La compréhension automatique de texte est une tâche faisant partie de la famille des systèmes de Question/Réponse où les questions ne sont pas à portée générale mais sont liées à un document particulier. Récemment de très grand corpus (SQuAD, MS MARCO) contenant des triplets (document, question, réponse) ont été mis à la disposition de la communauté scientifique afin de développer des méthodes supervisées à base de réseaux de neurones profonds en obtenant des résultats prometteurs. Ces méthodes sont cependant très gourmandes en données d’apprentissage, données qui n’existent pour le moment que pour la langue anglaise. Le but de cette étude est de permettre le développement de telles ressources pour d’autres langues à moindre coût en proposant une méthode générant de manière semi-automatique des questions à partir d’une analyse sémantique d’un grand corpus. La collecte de questions naturelle est réduite à un ensemble de validation/test. L’application de cette méthode sur le corpus CALOR-Frame a permis de développer la ressource CALOR-QUEST présentée dans cet article.

Spoken Conversational Search for General Knowledge
Lina M. Rojas Barahona | Pascal Bellec | Benoit Besset | Martinho Dossantos | Johannes Heinecke | Munshi Asadullah | Olivier Leblouch | Jeanyves. Lancien | Geraldine Damnati | Emmanuel Mory | Frederic Herledan
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

We present a spoken conversational question answering proof of concept that is able to answer questions about general knowledge from Wikidata. The dialogue agent does not only orchestrate various agents but also solve coreferences and ellipsis.

Development of a Universal Dependencies treebank for Welsh
Johannes Heinecke | Francis M. Tyers
Proceedings of the Celtic Language Technology Workshop

ConlluEditor: a fully graphical editor for Universal dependencies treebank files
Johannes Heinecke
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

MaskParse@Deskin at SemEval-2019 Task 1: Cross-lingual UCCA Semantic Parsing using Recursive Masked Sequence Tagging
Gabriel Marzinotto | Johannes Heinecke | Géraldine Damnati
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our recursive system for SemEval-2019 Task 1: Cross-lingual Semantic Parsing with UCCA. Each recursive step consists of two parts. We first perform semantic parsing using a sequence tagger to estimate the probabilities of the UCCA categories in the sentence. Then, we apply a decoding policy which interprets these probabilities and builds the graph nodes. Parsing is done recursively, we perform a first inference on the sentence to extract the main scenes and links and then we recursively apply our model on the sentence using a masking features that reflects the decisions made in previous steps. Process continues until the terminal nodes are reached. We chose a standard neural tagger and we focus on our recursive parsing strategy and on the cross lingual transfer problem to develop a robust model for the French language, using only few training samples

CALOR-QUEST : generating a training corpus for Machine Reading Comprehension models from shallow semantic annotations
Frederic Bechet | Cindy Aloui | Delphine Charlet | Geraldine Damnati | Johannes Heinecke | Alexis Nasr | Frederic Herledan
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

Machine reading comprehension is a task related to Question-Answering where questions are not generic in scope but are related to a particular document. Recently very large corpora (SQuAD, MS MARCO) containing triplets (document, question, answer) were made available to the scientific community to develop supervised methods based on deep neural networks with promising results. These methods need very large training corpus to be efficient, however such kind of data only exists for English and Chinese at the moment. The aim of this study is the development of such resources for other languages by proposing to generate in a semi-automatic way questions from the semantic Frame analysis of large corpora. The collect of natural questions is reduced to a validation/test set. We applied this method on the CALOR-Frame French corpus to develop the CALOR-QUEST resource presented in this paper.


Handling Normalization Issues for Part-of-Speech Tagging of Online Conversational Text
Géraldine Damnati | Jeremy Auguste | Alexis Nasr | Delphine Charlet | Johannes Heinecke | Frédéric Béchet
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


Multi-Model and Crosslingual Dependency Analysis
Johannes Heinecke | Munshi Asadullah
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes the system of the Team Orange-Deskiñ, used for the CoNLL 2017 UD Shared Task in Multilingual Dependency Parsing. We based our approach on an existing open source tool (BistParser), which we modified in order to produce the required output. Additionally we added a kind of pseudo-projectivisation. This was needed since some of the task’s languages have a high percentage of non-projective dependency trees. In most cases we also employed word embeddings. For the 4 surprise languages, the data provided seemed too little to train on. Thus we decided to use the training data of typologically close languages instead. Our system achieved a macro-averaged LAS of 68.61% (10th in the overall ranking) which improved to 69.38% after bug fixes.


Discourse Representation Theory et graphes sémantiques : formalisation sémantique en contexte industriel
Maxime Amblard | Johannes Heinecke | Estelle Maillebuau
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Ces travaux présentent une extension des représentations formelles pour la sémantique, de l’outil de traitement automatique des langues de Orange Labs1. Nous abordons ici uniquement des questions relatives à la construction des représentations sémantiques, dans le cadre de l’analyse linguistique. Afin d’obtenir des représentations plus fines de la structure argumentale des énoncés, nous incluons des concepts issus de la DRT dans le système de représentation basé sur les graphes sémantiques afin de rendre compte de la notion de portée.


Génération automatique des représentations ontologiques
Johannes Heinecke
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Depuis la conception du Web sémantique une tâche importante se pose au niveau de traitement automatique du langage : rendre accessible le contenu existant duWeb dit classique aux traitements et raisonnements ontologiques. Comme la plupart du contenu est composé de textes, on a besoin de générer des représentations ontologiques de ces informations textuelles. Dans notre article nous proposons une méthode afin d’automatiser cette traduction en utilisant des ontologies et une analyse syntaxico-sémantique profonde.


Eliminative Parsing with Graded Constraints
Johannes Heinecke | Jurgen Kunze | Wolfgang Menzel | Ingo Schroder
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

Eliminative Parsing with Graded Constraints
Johannes Heinecke | Jurgen Kunze | Wolfgang Menzel | Ingo Schroder
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics