Owen Cook


2026

This paper presents our findings for SemEval-2026 Task 9. We submit to all three subtasks using an LLM-as-an-annotator strategy, simulating the data annotation process with large language models. We created 30 LLM annotators using persona injection (also known as sociodemographic prompting) and experimented with various annotation aggregation methods, including Dawid-Skene and MACE. To further increase the variability in annotator responses, we used the hatefulness detection task as proxy for identifying polarisation. Our findings indicate that this reframing of the problem is effective for the binary classification of polarisation, but is less effective for finer-grained polarisation detection. For subtasks 2 and 3, majority voting yielded the best overall performance. While our unsupervised approach does not rank as highly as supervised ones, this work provides insight into the utility of persona-based prompting and the issue of LLM annotators exhibiting high intra-model agreement.

2025

Misinformation spreads rapidly on social media, confusing the truth and targeting potentially vulnerable people. To effectively mitigate the negative impact of misinformation, it must first be accurately detected before applying a mitigation strategy, such as X’s community notes, which is currently a manual process. This study takes a knowledge-based approach to misinformation detection, modelling the problem similarly to one of natural language inference. The EffiARA annotation framework is introduced, aiming to utilise inter- and intra-annotator agreement to understand the reliability of each annotator and influence the training of large language models for classification based on annotator reliability. In assessing the EffiARA annotation framework, the Russo-Ukrainian Conflict Knowledge-Based Misinformation Classification Dataset (RUC-MCD) was developed and made publicly available. This study finds that sample weighting using annotator reliability performs the best, utilising both inter- and intra-annotator agreement and soft label training. The highest classification performance achieved using Llama-3.2-1B was a macro-F1 of 0.757 and 0.740 using TwHIN-BERT-large.
Data annotation is an essential component of the machine learning pipeline; it is also a costly and time-consuming process. With the introduction of transformer-based models, annotation at the document level is increasingly popular; however, there is no standard framework for structuring such tasks. The EffiARA annotation framework is, to our knowledge, the first project to support the whole annotation pipeline, from understanding the resources required for an annotation task to compiling the annotated dataset and gaining insights into the reliability of individual annotators as well as the dataset as a whole. The framework’s efficacy is supported by two previous studies: one improving classification performance through annotator-reliability-based soft-label aggregation and sample weighting, and the other increasing the overall agreement among annotators through removing identifying and replacing an unreliable annotator. This work introduces the EffiARA Python package and its accompanying webtool, which provides an accessible graphical user interface for the system. We open-source the EffiARA Python package at https://github.com/MiniEggz/EffiARA and the webtool is publicly accessible at https://effiara.gate.ac.uk.