Marceau Hernandez
2025
Forbidden FRUIT is the Sweetest: An Annotated Tweets Corpus for French Unfrozen Idioms Identification
Julien Bezançon
|
Gaël Lejeune
|
Antoine Gautier
|
Marceau Hernandez
|
Félix Alié
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)
Multiword expressions (MWEs) are a key area of interest in NLP, studied across various languages and inspiring the creation of dedicated datasets and shared tasks such as PARSEME. Puns in multiword expressions (PMWEs) can be described as MWEs that have been “unfrozen” to acquire a new meaning or create a wordplay. Unlike MWEs, they have received little attention in NLP, mainly due to the lack of resources available for their study. In this context, we introduce the French Unfrozen Idioms in Tweets (FRUIT) corpus, a dataset of tweets spanning three years and comprising 60,617 tweets containing both MWEs and PMWE candidates. We first describe the process of constructing this corpus, followed by an overview of the manual annotation task performed by three experts on 600 tweets, achieving a maximum α score of 0.83. Insights from this manual annotation process were then used to develop a Game With A Purpose (GWAP) to annotate more tweets from the FRUIT corpus. This GWAP aims to enhance players’ understanding of MWEs and PMWEs. Currently, 13 players made 2,206 annotations on 931 tweets, reaching an α score of 0.70. In total, 1,531 tweets from the FRUIT corpus have been annotated.
2024
Trois méthodes Sorbonne et SNCF pour la résolution de QCM (DEFT2024)
Tom Rousseau
|
Marceau Hernandez
|
Iglika Stoupak
|
Angelo Mendoca-Manhoso
|
Andrea Blivet
|
Chang Liu
|
Toufik Boubehbiz
|
Corina Chuteaux
|
Gaël Guibon
|
Gaël Lejeune
|
Luce Lefeuvre
Actes du Défi Fouille de Textes@TALN 2024
Cet article décrit la participation de l’équipe Sorbonne-SNCF au Défi Fouille de Textes 2024, se concentrant sur la correction automatique de QCM en langue française. Le corpus, constitué de questions de pharmacologie, a été reformulé en assertions. Nous avons employé des techniques avancées de traitement du langage naturel pour traiter les réponses. Trois approches principales, NachosLLM, TTGV byfusion, et TTGV ollama multilabel, sont présentées avec des scores EMR respectifs de 2.94, 4.19 et 1.68. Les résultats obtenus montrent des niveaux de précision différents, en soulignant les limites des approches multi-étiquettes. Des suggestions d’amélioration incluent l’ajustement des modèles de langage et des critères de classification.
Search
Fix author
Co-authors
- Gaël Lejeune 2
- Félix Alié 1
- Julien Bezançon 1
- Andréa Blivet 1
- Toufik Boubehbiz 1
- show all...