Ismaïl Harrando

Also published as: Ismail Harrando


2024

pdf
Claire: Large Language Models for Spontaneous French Dialogue
Jérôme Louradour | Julie Hunter | Ismaïl Harrando | Guokan Shang | Virgile Rennard | Jean-Pierre Lorré
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Nous présentons la famille de modèles Claire, une collection de modèles de langage conçus pour améliorer les tâches nécessitant la compréhension des conversations parlées, tel que le résumé de réunions. Nos modèles résultent de la poursuite du pré-entraînement de deux modèles de base exclusivement sur des transcriptions de conversations et des pièces de théâtre. Aussi nous nous concentrons sur les données en français afin de contrebalancer l’accent mis sur l’anglais dans la plupart des corpus d’apprentissage. Cet article décrit le corpus utilisé, l’entraînement des modèles ainsi que leur évaluation. Les modèles, les données et le code qui en résultent sont publiés sous licences ouvertes, et partagés sur Hugging Face et GitHub.

2021

pdf
Apples to Apples: A Systematic Evaluation of Topic Models
Ismail Harrando | Pasquale Lisena | Raphael Troncy
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

From statistical to neural models, a wide variety of topic modelling algorithms have been proposed in the literature. However, because of the diversity of datasets and metrics, there have not been many efforts to systematically compare their performance on the same benchmarks and under the same conditions. In this paper, we present a selection of 9 topic modelling techniques from the state of the art reflecting a diversity of approaches to the task, an overview of the different metrics used to compare their performance, and the challenges of conducting such a comparison. We empirically evaluate the performance of these models on different settings reflecting a variety of real-life conditions in terms of dataset size, number of topics, and distribution of topics, following identical preprocessing and evaluation processes. Using both metrics that rely on the intrinsic characteristics of the dataset (different coherence metrics), as well as external knowledge (word embeddings and ground-truth topic labels), our experiments reveal several shortcomings regarding the common practices in topic models evaluation.

2020

pdf
TOMODAPI: A Topic Modeling API to Train, Use and Compare Topic Models
Pasquale Lisena | Ismail Harrando | Oussama Kandakji | Raphael Troncy
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

From LDA to neural models, different topic modeling approaches have been proposed in the literature. However, their suitability and performance is not easy to compare, particularly when the algorithms are being used in the wild on heterogeneous datasets. In this paper, we introduce ToModAPI (TOpic MOdeling API), a wrapper library to easily train, evaluate and infer using different topic modeling algorithms through a unified interface. The library is extensible and can be used in Python environments or through a Web API.