LARGEMED: A Resource for Identifying and Generating Paraphrases for French Medical Terms

Ioana Buhnila, Amalia Todirascu


Abstract
This article presents a method extending an existing French corpus of paraphrases of medical terms ANONYMOUS with new data from Web archives created during the Covid-19 pandemic. Our method semi-automatically detects new terms and paraphrase markers introducing paraphrases from these Web archives, followed by a manual annotation step to identify paraphrases and their lexical and semantic properties. The extended large corpus LARGEMED could be used for automatic medical text simplification for patients and their families. To automatise data collection, we propose two experiments. The first experiment uses the new LARGEMED dataset to train a binary classifier aiming to detect new sentences containing possible paraphrases. The second experiment aims to use correct paraphrases to train a model for paraphrase generation, by adapting T5 Language Model to the paraphrase generation task using an adversarial algorithm.
Anthology ID:
2024.determit-1.14
Volume:
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Giorgio Maria Di Nunzio, Federica Vezzani, Liana Ermakova, Hosein Azarbonyad, Jaap Kamps
Venues:
DeTermIt | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
141–151
Language:
URL:
https://aclanthology.org/2024.determit-1.14
DOI:
Bibkey:
Cite (ACL):
Ioana Buhnila and Amalia Todirascu. 2024. LARGEMED: A Resource for Identifying and Generating Paraphrases for French Medical Terms. In Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024, pages 141–151, Torino, Italia. ELRA and ICCL.
Cite (Informal):
LARGEMED: A Resource for Identifying and Generating Paraphrases for French Medical Terms (Buhnila & Todirascu, DeTermIt-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.determit-1.14.pdf