This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
DiegoMolla
Also published as:
Diego Mollá-Aliod,
Diego Mollá Aliod,
Diego Molla-Aliod,
Diego Mollá,
Diego Mollá Aliod
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
For a natural language dialogue system to engage in a goal-oriented conversation, it must elicit information from a user. Research on large language models (LLMs) often focuses on aligning them with user goals. Consequently, studies show these models can serve as chat assistants and answer the user questions. However, their information-elicitation abilities remain understudied. This work evaluates these abilities in goal-oriented dialogues for optimisation modelling. We compare two GPT-4-based settings that generate conversations between a modeller and a user over NL4Opt, a collection of simple optimisation problem descriptions, and analyse the modeller’s information elicitation. In the first, the modeller LLM has access to problem details and asks targeted questions, simulating an informed modeller. In the second, the LLM infers problem details through interaction — asking clarifying questions, interpreting responses, and gradually constructing an understanding of the task. This comparison assesses whether LLMs can elicit information and navigate problem discovery without prior knowledge of the problem. We compare modeller turns in both settings using human raters across criteria at the whole-dialogue and turn levels. Results show that a non-informed LLM can elicit information nearly as well as an informed one, producing high-quality dialogues. In particular, the success levels of both agents in the system without modeller access to the problem details are comparable to those in a system with full access. Dialogues rate well on coherence, and a post-annotation error analysis identified useful types for improving quality. GPT-4’s capability to elicit information in optimisation modelling dialogues suggests newer LLMs may possess even greater capability.
The ALTA shared tasks have been running annually since 2010. In 2025, the task focuses on the normalisation of Adverse Drug Events (ADE) found in forum posts to their corresponding standard term specified by the Medical Dictionary for Regulatory Activities (MedDRA). This is a comprehensive ontology of ADEs, which contains more ADE descriptions than those mentioned in the available training dataset. This makes the task more challenging than a straightforward supervised classification. We present the task, the evaluation criteria, and the results of the systems participating in the shared task.
Detecting emotions across different languages is challenging due to the varied and culturally nuanced ways of emotional expressions. The Semeval 2025 Task 11: Bridging the Gap in Text-Based emotion shared task was organised to investigate emotion recognition across different languages. The goal of the task is to implement an emotion recogniser that can identify the basic emotional states that general third-party observers would attribute to an author based on their written text snippet, along with the intensity of those emotions. We report our investigation of various task-adaptation strategies for LLMs in emotion recognition. We show that the most effective method for this task is to fine-tune a pre-trained multilingual LLM for each language.
The ALTA shared tasks have been running annually since 2010. In 2024, the purpose of the task is to detect machine-generated text in a hybrid setting where the text may contain portions of human text and portions machine-generated. In this paper, we present the task, the evaluation criteria, and the results of the systems participating in the shared task.
Biomedical question-answering systems remain popular for biomedical experts interacting with the literature to answer their medical questions. However, these systems are difficult to evaluate in the absence of costly human experts. Therefore, automatic evaluation metrics are often used in this space. Traditional automatic metrics such as ROUGE or BLEU, which rely on token overlap, have shown a low correlation with humans. We present a study that uses large language models (LLMs) to automatically evaluate systems from an international challenge on biomedical semantic indexing and question answering, called BioASQ. We measure the agreement of LLM-produced scores against human judgements. We show that LLMs correlate similarly to lexical methods when using basic prompting techniques. However, by aggregating evaluators with LLMs or by fine-tuning, we find that our methods outperform the baselines by a large margin, achieving a Spearman correlation of 0.501 and 0.511, respectively.
Monitoring and predicting the expression of suicidal risk in individuals’ social media posts is a central focus in clinical NLP. Yet, existing approaches frequently lack a crucial explainability component necessary for extracting evidence related to an individual’s mental health state. We describe the CSIRO Data61 team’s evidence extraction system submitted to the CLPsych 2024 shared task. The task aims to investigate the zero-shot capabilities of open-source LLM in extracting evidence regarding an individual’s assigned suicide risk level from social media discourse. The results are assessed against ground truth evidence annotated by psychological experts, with an achieved recall-oriented BERTScore of 0.919. Our findings suggest that LLMs showcase strong feasibility in the extraction of information supporting the evaluation of suicidal risk in social media discourse. Opportunities for refinement exist, notably in crafting concise and effective instructions to guide the extraction process.
Determining causal directions in sentences plays a critical role into understanding a cause-and-effect relationship between entities. In this paper, we show empirically that word occurrences from several Internet domains resemble the characteristics of causal directions. Our research contributes to the knowledge of the underlying data generation process behind causal directions. We propose a two-phase method: 1. Bayesian framework, which generates synthetic data from posteriors by incorporating word occurrences from the Internet domains. 2. Pre-trained BERT, which utilises semantics of words based on the context to perform classification. The proposed method achieves an improvement in performance for the Cause-Effect relations of the SemEval-2010 dataset, when compared with random guessing.
The ALTA shared tasks have been running annually since 2010. In 2023, the purpose of the task is to build automatic detection systems that can discriminate between human-written and synthetic text generated by Large Language Models (LLM). In this paper we present the task, the evaluation criteria, and the results of the systems participating in the shared task.
Linear programming (LP) problems are pervasive in real-life applications. However, despite their apparent simplicity, an untrained user may find it difficult to determine the linear model of their specific problem. We envisage the creation of a goal-oriented conversational agent that will engage in conversation with the user to elicit all information required so that a subsequent agent can generate the linear model. In this paper, we present an approach for the generation of sample dialogues that can be used to develop and train such a conversational agent. Using prompt engineering, we develop two agents that “talk” to each other, one acting as the conversational agent, and the other acting as the user. Using a set of text descriptions of linear problems from NL4Opt available to the user only, the agent and the user engage in conversation until the agent has retrieved all key information from the original problem description. We also propose an extrinsic evaluation of the dialogues by assessing how well the summaries generated by the dialogues match the original problem descriptions. We conduct human and automatic evaluations, including an evaluation approach that uses GPT-4 to mimic the human evaluation metrics. The evaluation results show an overall good quality of the dialogues, though research is still needed to improve the quality of the GPT-4 evaluation metrics. The resulting dialogues, including the human annotations of a subset, are available to the research community. The conversational agent used for the generation of the dialogues can be used as a baseline.
A simple machine learning model of pluralisation as a linear regression problem minimising a p-adic metric substantially outperforms even the most robust of Euclidean-space regressors on languages in the Indo-European, Austronesian, Trans New-Guinea, Sino-Tibetan, Nilo-Saharan, Oto-Meanguean and Atlantic-Congo language families. There is insufficient evidence to support modelling distinct noun declensions as a p-adic neighbourhood even in Indo-European languages.
The 2022 ALTA shared task has been running annually since 2010. This year, the shared task is a re-visit of the 2012 ALTA shared task. The purpose of this task is to classify sentences of medical publications using the PIBOSO taxonomy. This is a multi-label classification task which can help medical researchers and practitioners conduct Evidence Based Medicine (EBM). In this paper we present the task, the evaluation criteria, and the results of the systems participating in the shared task.
The LEAFTOP (language extracted automatically from thousands of passages) dataset consists of nouns that appear in multiple places in the four gospels of the New Testament. We use a naive approach — probabilistic inference — to identify likely translations in 1480 other languages. We evaluate this process and find that it provides lexiconaries with accuracy from 42% (Korafe) to 99% (Runyankole), averaging 72% correct across evaluated languages. The process translates up to 161 distinct lemmas from Koine Greek (average 159). We identify nouns which appear to be easy and hard to translate, language families where this technique works, and future possible improvements and extensions. The claims to novelty are: the use of a Koine Greek New Testament as the source language; using a fully-annotated manually-created grammatically parse of the source text; a custom scraper for texts in the target languages; a new metric for language similarity; a novel strategy for evaluation on low-resource languages.
The 2021 ALTA shared task is the 12th instance of a series of shared tasks organised by ALTA since 2010. Motivated by the advances in machine learning in the last 10 years, this year�s task is a re-visit of the 2011 ALTA shared task. Set within the framework of Evidence Based Medicine (EBM), the goal is to predict the qual-ity of the clinical evidence present in a set of documents. This year�s participant results didnot improve over those of participants from 2011.
Vent is a specialised iOS/Android social media platform with the stated goal to encourage people to post about their feelings and explicitly label them. In this paper, we study a snapshot of more than 100 million messages obtained from the developers of Vent, together with the labels assigned by the authors of the messages. We establish the quality of the self-annotated data by conducting a qualitative analysis, a vocabulary based analysis, and by training and testing an emotion classifier. We conclude that the self-annotated labels of our corpus are indeed indicative of the emotional contents expressed in the text and thus can support more detailed analyses of emotion expression on social media, such as emotion trajectories and factors influencing them.
Free text data from social media is now widely used in natural language processing research, and one of the most common machine learning tasks performed on this data is classification. Generally speaking, performances of supervised classification algorithms on social media datasets are lower than those on texts from other sources, but recently-proposed transformer-based models have considerably improved upon legacy state-of-the-art systems. Currently, there is no study that compares the performances of different variants of transformer-based models on a wide range of social media text classification datasets. In this paper, we benchmark the performances of transformer-based pre-trained models on 25 social media text classification datasets, 6 of which are health-related. We compare three pre-trained language models, RoBERTa-base, BERTweet and ClinicalBioBERT in terms of classification accuracy. Our experiments show that RoBERTa-base and BERTweet perform comparably on most datasets, and considerably better than ClinicalBioBERT, even on health-related datasets.
The 2020 ALTA shared task is the 11th in stance of a series of shared tasks organised by ALTA since 2010. The task is to classify texts posted in social media according to human judgements expressed in them. The data used for this task is a subset of SemEval 2018 AIT DISC, which has been annotated by domain experts for this task. In this paper we introduce the task, describe the data and present the results of participating systems.
We present an overview of the 2019 ALTA shared task. This is the 10th of the series of shared tasks organised by ALTA since 2010. The task was to detect the target of sarcastic comments posted on social media. We intro- duce the task, describe the data and present the results of baselines and participants. This year’s shared task was particularly challenging and no participating systems improved the re- sults of our baseline.
We present an overview of the 2018 ALTA shared task. This is the 9th of the series of shared tasks organised by ALTA since 2010. The task was to classify Australian patent classifications following the sections defined by the International Patient Classification (IPC), using data made available by IP Australia. We introduce the task, describe the data and present the results of the participating teams. Some of the participating teams outperformed state of the art.
This paper describes Macquarie University’s contribution to the BioASQ Challenge (BioASQ 6b, Phase B). We focused on the extraction of the ideal answers, and the task was approached as an instance of query-based multi-document summarisation. In particular, this paper focuses on the experiments related to the deep learning and reinforcement learning approaches used in the submitted runs. The best run used a deep learning model under a regression-based framework. The deep learning architecture used features derived from the output of LSTM chains on word embeddings, plus features based on similarity with the query, and sentence position. The reinforcement learning approach was a proof-of-concept prototype that trained a global policy using REINFORCE. The global policy was implemented as a neural network that used tf.idf features encoding the candidate sentence, question, and context.
The automation of text summarisation of biomedical publications is a pressing need due to the plethora of information available online. This paper explores the impact of several supervised machine learning approaches for extracting multi-document summaries for given queries. In particular, we compare classification and regression approaches for query-based extractive summarisation using data provided by the BioASQ Challenge. We tackled the problem of annotating sentences for training classification systems and show that a simple annotation approach outperforms regression-based summarisation.
Macquarie University’s contribution to the BioASQ challenge (Task 5b Phase B) focused on the use of query-based extractive summarisation techniques for the generation of the ideal answers. Four runs were submitted, with approaches ranging from a trivial system that selected the first n snippets, to the use of deep learning approaches under a regression framework. Our experiments and the ROUGE results of the five test batches of BioASQ indicate surprisingly good results for the trivial approach. Overall, most of our runs on the first three test batches achieved the best ROUGE-SU4 results in the challenge.
Semi-supervised clustering is an attractive alternative for traditional (unsupervised) clustering in targeted applications. By using the information of a small annotated dataset, semi-supervised clustering can produce clusters that are customized to the application domain. In this paper, we present a semi-supervised clustering technique based on a multi-objective evolutionary algorithm (NSGA-II-clus). We apply this technique to the task of clustering medical publications for Evidence Based Medicine (EBM) and observe an improvement of the results against unsupervised and other semi-supervised clustering techniques.