CLISTER : A Corpus for Semantic Textual Similarity in French Clinical Narratives

Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol


Abstract
Modern Natural Language Processing relies on the availability of annotated corpora for training and evaluating models. Such resources are scarce, especially for specialized domains in languages other than English. In particular, there are very few resources for semantic similarity in the clinical domain in French. This can be useful for many biomedical natural language processing applications, including text generation. We introduce a definition of similarity that is guided by clinical facts and apply it to the development of a new French corpus of 1,000 sentence pairs manually annotated according to similarity scores. This new sentence similarity corpus is made freely available to the community. We further evaluate the corpus through experiments of automatic similarity measurement. We show that a model of sentence embeddings can capture similarity with state-of-the-art performance on the DEFT STS shared task evaluation data set (Spearman=0.8343). We also show that the corpus is complementary to DEFT STS.
Anthology ID:
2022.lrec-1.459
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4306–4315
Language:
URL:
https://aclanthology.org/2022.lrec-1.459
DOI:
Bibkey:
Cite (ACL):
Nicolas Hiebel, Olivier Ferret, Karën Fort, and Aurélie Névéol. 2022. CLISTER : A Corpus for Semantic Textual Similarity in French Clinical Narratives. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4306–4315, Marseille, France. European Language Resources Association.
Cite (Informal):
CLISTER : A Corpus for Semantic Textual Similarity in French Clinical Narratives (Hiebel et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2022.lrec-1.459.pdf
Data
BIOSSES