Lewis N. Watson

2026

Frame2KG: A Benchmark and Evaluation Toolkit for Interpretable Frame-to-Graph Generation
Lewis N. Watson | Carl Strathearn | Kenny Mitchell | Yanchao Yu
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Interpretable frame-to-knowledge-graph (Frame2KG) generation enables structured visual scene representation while supporting on-device inference to enhance privacy, improve interpretability, and minimise compute. We introduce Frame2KG-YC2, a synthetic, reproducible dataset derived from YouCook2 that pairs keyframes with schema-valid JSON knowledge graphs containing typed, spatially grounded entities and semantic predicates, alongside faithful textual paraphrases. Using this corpus, we fine-tune Qwen2.5-VL models (3B and 7B) with parameter-efficient LoRA adapters on attention layers (QKVO), with and without GateProj/Up/Down MLP projections. For evaluation and benchmarking, we propose a deterministic toolkit featuring two-stage node matching, an IoU gate followed by Hungarian assignment on blended spatial-semantic similarity, and comprehensive metrics spanning node/edge precision-recall-F1, matched-pair IoU, and structural validity. On a held-out test set, our models achieve Node F1μ up to 0.621 and Edge F1μ up to 0.208, with mean matched IoU of ≈0.61 and >98% schema conformity. We show that MLP gating consistently improves predicate accuracy and spatial grounding, while post-training quantisation maintains accuracy and improves deployability on edge hardware. We release the dataset, code, adapters, and evaluation toolkit to establish an open, interpretable baseline for future temporal and multi-view extensions.

pdf bib abs

PAIR: A Pilot Dataset for Dual Perspective-based Video-Grounded Dialogue and Reconciliation
Lewis N. Watson | Carl Strathearn | Kenny Mitchell | Yanchao Yu
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Collaborative dialogue in multi-agent settings often requires interlocutors to integrate partially overlapping perceptual information in order to construct a shared representation of a dynamic environment. We introduce PAIR, a pilot conversational corpus designed to examine how humans coordinate under systematic perceptual asymmetry. The dataset comprises 15 dialogues in which participants observed the same activity from complementary egocentric and exocentric video perspectives and engaged in open-ended discussion to produce a joint account. All transcripts were manually verified and annotated with 42 dialogue act categories, enabling fine-grained analysis of interactional structure. Beyond descriptive statistics, PAIR supports examination of measurable conversational configurations, including turn distribution, participation symmetry, and dialogue act composition, which together provide structural indicators of how perspective integration unfolds in dialogue. Although intentionally lightweight, PAIR is positioned as a controlled benchmark for analysing collaborative dialogue mechanisms rather than a large-scale training resource. The corpus supports dialogue act classification, video-grounded dialogue modelling, and investigation of multi-agent reasoning under distributed perceptual access. By coupling dual-perspective grounding with explicit interactional annotation, PAIR offers a compact testbed for studying reconciliation dynamics in task-oriented dialogue.

2024

pdf bib abs

ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation
Lewis N. Watson | Dimitra Gkatzia
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

Reproducibility is a cornerstone of scientific research, ensuring the reliability and generalisability of findings. The ReproNLP Shared Task on Reproducibility of Evaluations in NLP aims to assess the reproducibility of human evaluation studies. This paper presents a reproduction study of the human evaluation experiment presented in “Hierarchical Sketch Induction for Paraphrase Generation” by Hosking et al. (2022). The original study employed a human evaluation on Amazon Mechanical Turk, assessing the quality of paraphrases generated by their proposed model using three criteria: meaning preservation, fluency, and dissimilarity. In our reproduction study, we focus on the meaning preservation criterion and utilise the Prolific platform for participant recruitment, following the ReproNLP challenge’s common approach to reproduction. We discuss the methodology, results, and implications of our reproduction study, comparing them to the original findings. Our findings contribute to the understanding of reproducibility in NLP research and highlights the potential impact of platform changes and evaluation criteria on the reproducibility of human evaluation studies.

pdf bib abs

Exploring the impact of data representation on neural data-to-text generation
David M. Howcroft | Lewis N. Watson | Olesia Nedopas | Dimitra Gkatzia
Proceedings of the 17th International Natural Language Generation Conference

A relatively under-explored area in research on neural natural language generation is the impact of the data representation on text quality. Here we report experiments on two leading input representations for data-to-text generation: attribute-value pairs and Resource Description Framework (RDF) triples. Evaluating the performance of encoder-decoder seq2seq models as well as recent large language models (LLMs) with both automated metrics and human evaluation, we find that the input representation does not seem to have a large impact on the performance of either purpose-built seq2seq models or LLMs. Finally, we present an error analysis of the texts generated by the LLMs and provide some insights into where these models fail.

Co-authors

Olesia Nedopas 1

Venues

Fix author