Hicham El Boukkouri


2021

pdf bib
Differential Evaluation: a Qualitative Analysis of Natural Language Processing System Behavior Based Upon Data Resistance to Processing
Lucie Gianola | Hicham El Boukkouri | Cyril Grouin | Thomas Lavergne | Patrick Paroubek | Pierre Zweigenbaum
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

Most of the time, when dealing with a particular Natural Language Processing task, systems are compared on the basis of global statistics such as recall, precision, F1-score, etc. While such scores provide a general idea of the behavior of these systems, they ignore a key piece of information that can be useful for assessing progress and discerning remaining challenges: the relative difficulty of test instances. To address this shortcoming, we introduce the notion of differential evaluation which effectively defines a pragmatic partition of instances into gradually more difficult bins by leveraging the predictions made by a set of systems. Comparing systems along these difficulty bins enables us to produce a finer-grained analysis of their relative merits, which we illustrate on two use-cases: a comparison of systems participating in a multi-label text classification task (CLEF eHealth 2018 ICD-10 coding), and a comparison of neural models trained for biomedical entity detection (BioCreative V chemical-disease relations dataset).

2020

pdf bib
Ré-entraîner ou entraîner soi-même ? Stratégies de pré-entraînement de BERT en domaine médical (Re-train or train from scratch ? Pre-training strategies for BERT in the medical domain )
Hicham El Boukkouri
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 3 : Rencontre des Étudiants Chercheurs en Informatique pour le TAL

Les modèles BERT employés en domaine spécialisé semblent tous découler d’une stratégie assez simple : utiliser le modèle BERT originel comme initialisation puis poursuivre l’entraînement de celuici sur un corpus spécialisé. Il est clair que cette approche aboutit à des modèles plutôt performants (e.g. BioBERT (Lee et al., 2020), SciBERT (Beltagy et al., 2019), BlueBERT (Peng et al., 2019)). Cependant, il paraît raisonnable de penser qu’entraîner un modèle directement sur un corpus spécialisé, en employant un vocabulaire spécialisé, puisse aboutir à des plongements mieux adaptés au domaine et donc faire progresser les performances. Afin de tester cette hypothèse, nous entraînons des modèles BERT à partir de zéro en testant différentes configurations mêlant corpus généraux et corpus médicaux et biomédicaux. Sur la base d’évaluations menées sur quatre tâches différentes, nous constatons que le corpus de départ influence peu la performance d’un modèle BERT lorsque celui-ci est ré-entraîné sur un corpus médical.

pdf bib
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Hiroshi Noji | Pierre Zweigenbaum | Jun’ichi Tsujii
Proceedings of the 28th International Conference on Computational Linguistics

Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level, and open-vocabulary representations.

2019

pdf bib
Embedding Strategies for Specialized Domains: Application to Clinical Entity Recognition
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Pierre Zweigenbaum
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Using pre-trained word embeddings in conjunction with Deep Learning models has become the “de facto” approach in Natural Language Processing (NLP). While this usually yields satisfactory results, off-the-shelf word embeddings tend to perform poorly on texts from specialized domains such as clinical reports. Moreover, training specialized word representations from scratch is often either impossible or ineffective due to the lack of large enough in-domain data. In this work, we focus on the clinical domain for which we study embedding strategies that rely on general-domain resources only. We show that by combining off-the-shelf contextual embeddings (ELMo) with static word2vec embeddings trained on a small in-domain corpus built from the task data, we manage to reach and sometimes outperform representations learned from a large corpus in the medical domain.