This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
LucasSterckx
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Des travaux récents ont démontré que les grands modèles de langue (LLMs) sont capables de traiter des données biomédicales. Cependant, leur déploiement en zéro-shot dans les hôpitaux présente de nombreux défis. Les modèles sont souvent trop coûteux pour une inférence et un ajustement local ; leur capacité multilingue est inférieure par rapport à leur performance en anglais ; les ensembles de données de préentraînement, souvent issus de publications biomédicales, sont trop génériques pour une performance optimale, compte tenu de la complexité des scénarios cliniques présents dans les données de santé. Nous abordons ces défis et d’autres encore dans un cas d’usage multilingue réel à travers le développement d’un pipeline de normalisation de concepts de bout en bout. Son objectif principal est de convertir l’information issue de dossiers de santé non structurés (multilingues) en ontologies codifiées, permettant ainsi la détection de concepts au sein de l’historique médical d’un patient. Dans cet article, nous démontrons quantitativement l’importance de données réelles et spécifiques au domaine pour des applications cliniques à grande échelle.
This paper describes IDLab’s text classification systems submitted to Task A as part of the CLPsych 2019 shared task. The aim of this shared task was to develop automated systems that predict the degree of suicide risk of people based on their posts on Reddit. Bag-of-words features, emotion features and post level predictions are used to derive user-level predictions. Linear models and ensembles of these models are used to predict final scores. We find that predicting fine-grained risk levels is much more difficult than flagging potentially at-risk users. Furthermore, we do not find clear added value from building richer ensembles compared to simple baselines, given the available training data and the nature of the prediction task.
Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations of the short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method.
This paper describes the IDLab system submitted to Task A of the CLPsych 2018 shared task. The goal of this task is predicting psychological health of children based on language used in hand-written essays and socio-demographic control variables. Our entry uses word- and character-based features as well as lexicon-based features and features derived from the essays such as the quality of the language. We apply linear models, gradient boosting as well as neural-network based regressors (feed-forward, CNNs and RNNs) to predict scores. We then make ensembles of our best performing models using a weighted average.
Comprehending lyrics, as found in songs and poems, can pose a challenge to human and machine readers alike. This motivates the need for systems that can understand the ambiguity and jargon found in such creative texts, and provide commentary to aid readers in reaching the correct interpretation. We introduce the task of automated lyric annotation (ALA). Like text simplification, a goal of ALA is to rephrase the original text in a more easily understandable manner. However, in ALA the system must often include additional information to clarify niche terminology and abstract concepts. To stimulate research on this task, we release a large collection of crowdsourced annotations for song lyrics. We analyze the performance of translation and retrieval models on this task, measuring performance with both automated and human evaluation. We find that each model captures a unique type of information important to the task.