This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we generate only three BibTeX files per volume, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Leonardo CampillosLlanos
Also published as:
Leonardo Campillos Llanos,
Leonardo Campillos-Llanos
Patients can not always completely understand medical documents given the myriad of technical terms they contain. Automatic text simplification techniques can help, but they must guarantee that the content is transmitted rigorously and not creating wrong information. In this work, we tested: 1) lexicon-based simplification approaches, using a Spanish lexicon of technical and laymen terms collected for this task (SimpMedLexSp); 2) deep-learning (DL) based methods, with BART-based and prompt-learning-based models; and 3) a combination of both techniques. As a test set, we used 5000 parallel (technical and laymen) sentence pairs: 3800 manually aligned sentences from the CLARA-MeD corpus; and 1200 sentences from clinical trials simplified by linguists. We conducted a quantitative evaluation with standard measures (BLEU, ROUGE and SARI) and a human evaluation, in which eleven subjects scored the simplification output of several methods. In our experiments, the lexicon improved the quantitative results when combined with the DL models. The simplified sentences using only the lexicon were assessed with the highest scores regarding semantic adequacy; however, their fluency needs to be improved. The prompt-method had similar ratings in this aspect and in simplification. We make available the models and the data to reproduce our results.
We report the work-in-progress of collecting MedLexSp, an unified medical lexicon for the Spanish language, featuring terms and inflected word forms mapped to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs), semantic types and groups. First, we leveraged a list of term lemmas and forms from a previous project, and mapped them to UMLS terms and CUIs. To enrich the lexicon, we used both domain-corpora (e.g. Summaries of Product Characteristics and MedlinePlus) and natural language processing techniques such as string distance methods or generation of syntactic variants of multi-word terms. We also added term variants by mapping their CUIs to missing items available in the Spanish versions of standard thesauri (e.g. Medical Subject Headings and World Health Organization Adverse Drug Reactions terminology). We enhanced the vocabulary coverage by gathering missing terms from resources such as the Anatomical Therapeutical Classification, the National Cancer Institute (NCI) Dictionary of Cancer Terms, OrphaData, or the Nomenclátor de Prescripción for drug names. Part-of-Speech information is being included in the lexicon, and the current version amounts up to 76 454 lemmas and 203 043 inflected forms (including conjugated verbs, number and gender variants), corresponding to 30 647 UMLS CUIs. MedLexSp is distributed freely for research purposes.
We present the work-in-progress of automating the classification of doctor-patient questions in the context of a simulated consultation with a virtual patient. We classify questions according to the computational strategy (rule-based or other) needed for looking up data in the clinical record. We compare ‘traditional’ machine learning methods (Gaussian and Multinomial Naive Bayes, and Support Vector Machines) and a neural network classifier (FastText). We obtained the best results with the SVM using semantic annotations, whereas the neural classifier achieved promising results without it.
While measuring the readability of texts has been a long-standing research topic, assessing the technicality of terms has only been addressed more recently and mostly for the English language. In this paper, we train a learning-to-rank model to determine a specialization degree for each term found in a given list. Since no training data for this task exist for French, we train our system with non-lexical features on English data, namely, the Consumer Health Vocabulary, then apply it to French. The features include the likelihood ratio of the term based on specialized and lay language models, and tests for containing morphologically complex words. The evaluation of this approach is conducted on 134 terms from the UMLS Metathesaurus and 868 terms from the Eugloss thesaurus. The Normalized Discounted Cumulative Gain obtained by our system is over 0.8 on both test sets. Besides, thanks to the learning-to-rank approach, adding morphological features to the language model features improves the results on the Eugloss thesaurus.
We introduce a dialogue task between a virtual patient and a doctor where the dialogue system, playing the patient part in a simulated consultation, must reconcile a specialized level, to understand what the doctor says, and a lay level, to output realistic patient-language utterances. This increases the challenges in the analysis and generation phases of the dialogue. This paper proposes methods to manage linguistic and terminological variation in that situation and illustrates how they help produce realistic dialogues. Our system makes use of lexical resources for processing synonyms, inflectional and derivational variants, or pronoun/verb agreement. In addition, specialized knowledge is used for processing medical roots and affixes, ontological relations and concept mapping, and for generating lay variants of terms according to the patient’s non-expert discourse. We also report the results of a first evaluation carried out by 11 users interacting with the system. We evaluated the non-contextual analysis module, which supports the Spoken Language Understanding step. The annotation of task domain entities obtained 91.8% of Precision, 82.5% of Recall, 86.9% of F-measure, 19.0% of Slot Error Rate, and 32.9% of Sentence Error Rate.
This article summarizes the evaluation process of an interface under development to consult an oral corpus of learners of Spanish as a Foreign Language. The databank comprises 40 interviews with students with over 9 different mother tongues collected for Error Analysis. XML mark-up is used to code the information about the learners and their errors (with an explanation), and the search tool makes it is possible to look up these errors and to listen to the utterances where they appear. The formative evaluation was performed to improve the interface during the design stage by means of a questionnaire which addressed issues related to the teachers' beliefs about languages, their opinion about the Error Analysis methodology, and specific points about the interface design and usability. The results unveiled some deficiencies of the current prototype as well as the interests of the teaching professionals which should be considered to bridge the gap between technology development and its pedagogical applications.
This paper presents a method for designing, compiling and annotating corpora intended for language learners. In particular, we focus on spoken corpora for being used as complementary material in the classroom as well as in examinations. We describe the three corpora (Spanish, Chinese and Japanese) compiled by the Laboratorio de Lingüística Informática at the Autonomous University of Madrid (LLI-UAM). A web-based concordance tool has been used to search for examples in the corpus, and providing the text along with the corresponding audio. Teaching materials from the corpus, consisting the texts, the audio files and exercises on them, are currently on development.