uppdf
bib
Proceedings of the 6th Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025)
Michael Strube
|
Chloe Braud
|
Christian Hardmeier
|
Junyi Jessy Li
|
Sharid Loaiciga
|
Amir Zeldes
|
Chuyuan Li
pdf
bib
abs
Long Context Benchmark for the Russian Language
Igor Churin
|
Murat Apishev
|
Maria Tikhonova
|
Denis Shevelev
|
Aydar Bulatov
|
Yuri Kuratov
|
Sergei Averkiev
|
Alena Fenogenova
Recent progress in Natural Language Processing (NLP) has driven the creation of Large Language Models (LLMs) capable of tackling a vast range of tasks. A critical property of these models is their ability to handle large documents and process long token sequences, which has fostered the need for a robust evaluation methodology for long-text scenarios. To meet this requirement in the context of the Russian language, we present our benchmark consisting of 18 datasets designed to assess LLM performance in tasks such as information retrieval, knowledge extraction, machine reading, question answering, and reasoning. These datasets are categorized into four levels of complexity, enabling model evaluation across context lengths up to 128k tokens. To facilitate further research, we provide open-source datasets, a codebase, and a public leaderboard associated with the benchmark.
pdf
bib
abs
Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals
Jingni Wu
|
Amir Zeldes
Discourse markers (DMs) like ‘but’ or ‘then’ are crucial for creating coherence in discourse, yet they are often replaced by or co-occur with non-DMs (‘in the morning’ can mean the same as ‘then’), and both can be ambiguous (‘since’ can refer to time or cause). The interaction mechanism between such signals remains unclear but pivotal for their disambiguation. In this paper we investigate the relationship between DM polysemy and co-occurrence of non-DM signals in English, as well as the influence of genre on these patterns. Using the framework of eRST, we propose a graded definition of DM polysemy, and conduct correlation and regression analyses to examine whether polysemous DMs are accompanied by more numerous and diverse non-DM signals. Our findings reveal that while polysemous DMs do co-occur with more diverse non-DMs, the total number of co-occurring signals does not necessarily increase. Moreover, genre plays a significant role in shaping DM-signal interactions.
pdf
bib
abs
Enhancing the Automatic Classification of Metadiscourse in Low-Proficiency Learners’ Spoken and Written English Texts Using XLNet
Wenwen Guan
|
Marijn Alta
|
Jelke Bloem
This study aims to enhance the automatic identification and classification of metadiscourse markers in English texts, evaluating various large language models for the purpose. Metadiscourse is a commonly used rhetorical strategy in both written and spoken language to guide addressees through discourse. Due to its linguistic complexity and dependency on the context, automated metadiscourse classification is challenging. With a hypothesis that LLMs may handle complicated tasks more effectively than supervised machine learning approaches, we tune and evaluate seven encoder language models on the task using a dataset totalling 575,541 tokens and annotated with 24 labels. The results show a clear improvement over supervised machine learning approaches as well as an untuned Llama3.3-70B-Instruct baseline, with XLNet-large achieving an accuracy and F1-score of 0.91 and 0.93, respectively. However, four less frequent categories record F-scores below 0.5, highlighting the need for more balanced data representation.
pdf
bib
abs
Entity Tracking in Small Language Models: An Attention-Based Study of Parameter-Efficient Fine-Tuning
Sungho Jeon
|
Michael Strube
The ability to track entities is fundamental for language understanding, yet the internal mechanisms governing this capability in Small Language Models (SLMs) are poorly understood. Previous studies often rely on indirect probing or complex interpretability methods, leaving a gap for lightweight diagnostics that connect model behavior to performance. To bridge this gap, we introduce a framework to analyze entity tracking by measuring the attention flow between entity and non-entity tokens within SLMs. We apply this to analyze models both before and after Parameter-Efficient Fine-Tuning (PEFT). Our analysis reveals two key findings. First, SLMs’ attentional strategies vary significantly with text type, but entities consistently receive a high degree of focus. Second, we show that PEFT – specifically QLoRA – dramatically improves classification performance on entity-centric tasks by increasing the model’s attentional focus on entity-related tokens. Our work provides direct evidence for how PEFT can refine a model’s internal mechanisms and establishes attention analysis as a valuable, lightweight diagnostic tool for interpreting and improving SLMs.
pdf
bib
abs
Stance Detection on Nigerian 2023 Election Tweets Using BERT: A Low-Resource Transformer-Based Approach
Mahmoud Ahmad
|
Habeebah Kakudi
This paper investigates stance detection on Nigerian 2023 election tweets by comparing transformer-based and classical machine learning models. A balanced dataset of 2,100 annotated tweets was constructed, and BERT-base-uncased was fine-tuned to classify stances into Favor, Neutral, and Against. The model achieved 98.1% accuracy on an 80/20 split and an F1-score of 96.9% under 5-fold cross-validation. Baseline models such as Naïve Bayes, Logistic Regression, Random Forest, and SVM were also evaluated, with SVM achieving 97.6% F1. While classical methods remain competitive on curated datasets, BERT proved more robust in handling noisy, sarcastic, and ambiguous text, making it better suited for real-world applications in low-resource African NLP contexts.
pdf
bib
abs
Code-switching in Context: Investigating the Role of Discourse Topic in Bilingual Speech Production
Debasmita Bhattacharya
|
Anxin Yi
|
Siying Ding
|
Julia Hirschberg
Code-switching (CSW) in speech is motivated by conversational factors across levels of linguistic analysis. While we know much about why speakers code-switch, there remains great scope for exploring how CSW occurs in speech, particularly within the discourse-level linguistic context. We build on prior work by asking: how are patterns of CSW influenced by different conversational contexts spanning Academic, Cultural, Personal, and Professional discourse topics? To answer this, we annotate a Mandarin-English spontaneous speech corpus, and analyze its discourse topics alongside various aspects of CSW production. We show that discourse topics interact significantly with utterance-level CSW, resulting in distinctive patterns of CSW presence, richness, language direction, and syntax that are uniquely associated with different contexts. Our work is the first to take such a context-sensitive approach to studying CSW, contributing to a broader understanding of the discourse topics that motivate speakers to code-switch in diverse ways.
pdf
bib
abs
“Otherwise” in Context: Exploring Discourse Functions with Language Models
Guifu Liu
|
Bonnie Webber
|
Hannah Rohde
Discourse adverbials are key features of discourse coherence, but their function is often ambiguous. In this work, we investigate how the discourse function of otherwise varies in different contexts. We revise the function set in Rohde et al. (2018b) to account for a new meaning we have encountered. In turn, we create the “otherwise” corpus, a dataset of naturally occurring passages annotated for discourse functions, and identify lexical signals that make a function available with a corpus study. We define continuation acceptability, a metric based on surprisal to probe language models for what they take the function of otherwise to be in a given context. Our experiments show that one can improve function inference by focusing solely on tokens up to and including the head verb of the continuation (i.e., otherwise clause) that have the most varied surprisal across function-disambiguating discourse markers. Lastly, we observe that some of these tokens confirm lexical signals we found in our earlier corpus study, which provides some promising evidence to motivate future pragmatic studies in language models
pdf
bib
abs
On the Role of Context for Discourse Relation Classification in Scientific Writing
Stephen Wan
|
Wei Liu
|
Michael Strube
With the increasing use of generative Artificial Intelligence (AI) methods to support science workflows, we are interested in the use of discourse-level information to find supporting evidence for AI generated scientific claims. A first step towards this objective is to examine the task of inferring discourse structure in scientific writing.In this work, we present a preliminary investigation of pretrained language model (PLM) and Large Language Model (LLM) approaches for Discourse Relation Classification (DRC), focusing on scientific publications, an under-studied genre for this task. We examine how context can help with the DRC task, with our experiments showing that context, as defined by discourse structure, is generally helpful. We also present an analysis of which scientific discourse relation types might benefit most from context.
pdf
bib
abs
Zero-Shot Belief: A Hard Problem for LLMs
John Murzaku
|
Owen Rambow
We present two LLM-based approaches to zero-shot source-and-target belief prediction on FactBank: a unified system that identifies events, sources, and belief labels in a single pass, and a hybrid approach that uses a fine-tuned DeBERTa tagger for event detection. We show that multiple open-sourced, closed-source, and reasoning-based LLMs struggle with the task. We then argue that careful source normalization is crucial and provide a few-shot normalization method that improves alignment between predicted and gold-standard sources. Using the hybrid approach, we achieve new state-of-the-art results on FactBank and offer a detailed error analysis. Our approach is then tested on the Italian belief corpus ModaFact. Although we fall short of prior fine-tuned baselines, our zero-shot methods substantially narrow the gap, emphasizing the promise of hybrid pipelines for belief prediction beyond English. We conclude that integrated event tagging, careful prompting, and robust source normalization all jointly enable effective zero-shot belief models.
pdf
bib
abs
Probing the Limits of Multilingual Language Understanding: Low-Resource Language Proverbs as LLM Benchmark for AI Wisdom
Surendrabikram Thapa
|
Kritesh Rauniyar
|
Hariram Veeramani
|
Surabhi Adhikari
|
Imran Razzak
|
Usman Naseem
Understanding and interpreting culturally specific language remains a significant challenge for multilingual natural language processing (NLP) systems, particularly for less-resourced languages. To address this problem, this paper introduces PRONE, a novel dataset of 2,830 Nepali proverbs, and evaluates the performance of various language models (LMs) in two tasks: (i) identifying the correct meaning of a proverb from multiple choices, and (ii) categorizing proverbs into predefined thematic categories. The models, including both open-source and proprietary, were tested in zero-shot and few-shot settings with prompts in English and Nepali. While models like GPT-4o demonstrated promising results and achieved the highest performance among LMs, they still fall short of human-level accuracy in understanding and categorizing culturally nuanced content, highlighting the need for more inclusive NLP.
pdf
bib
abs
Measuring Sexism in US Elections: A Comparative Analysis of X Discourse from 2020 to 2024
Anna Fuchs
|
Elisa Noltenius
|
Caroline Weinzierl
|
Bolei Ma
|
Anna-Carolina Haensch
Sexism continues to influence political campaigns, affecting public perceptions of candidates in a variety of ways. This paper examines sexist content on the social media platform X during the 2020 and 2024 US election campaigns, focusing on both male and female candidates. Two approaches, single-step and two-step categorization, were employed to classify tweets into different sexism categories. By comparing these approaches against a human-annotated subsample, we found that the single-step approach outperformed the two-step approach. Our analysis further reveals that sexist content increased over time, particularly between the 2020 and 2024 elections, indicating that female candidates face a greater volume of sexist tweets compared to their male counterparts. Compared to human annotations, GPT-4 struggled with detecting sexism, reaching an accuracy of about 51%. Given both the low agreement among the human annotators and the obtained accuracy of the model, our study emphasizes the challenges in detecting complex social phenomena such as sexism.
pdf
bib
abs
Discourse Relation Recognition with Language Models Under Different Data Availability
Shuhaib Mehri
|
Chuyuan Li
|
Giuseppe Carenini
Large Language Models (LLMs) have demonstrated remarkable performance across various NLP tasks, yet they continue to face challenges in discourse relation recognition (DRR). Current state-of-the-art methods for DRR primarily rely on smaller pre-trained language models (PLMs). In this study, we conduct a comprehensive analysis of different approaches using both PLMs and LLMs, evaluating their effectiveness for DRR at multiple granularities and under different data availability settings. Our findings indicate that no single approach consistently outperforms the others, and we offer a general comparison framework to guide the selection of the most appropriate model based on specific DRR requirements and data conditions.
pdf
bib
abs
EmbiText: Embracing Ambiguity by Annotation, Recognition and Generation of Pronominal Reference with Event-Entity Ambiguity
Amna Sheikh
|
Christian Hardmeier
Consider the example “The bird sang the nursery rhyme beautifully. It made everyone in the room smile”. The pronoun ‘it’ here refers either to the bird or to the event of singing. This example is inherently ambiguous. It cannot be meaningfully disambiguated as an event or entity reference, as both readings result in the same text meaning. This study introduces a new dataset EMBITEXT to preserve ambiguity in the language by navigating through the ambiguity surrounding the pronominal reference to the entity or event. Oftentimes, ambiguity does not necessarily need to be resolved but is modelled carefully. Furthermore, this study explores the capacity of LLMs (Llama, Mistral, Gemini, Claude AI) to embrace ambiguity in generating text that exhibit referential ambiguity via an In-Context learning approach. To evaluate of the dataset, RoBERTa was finetuned on this data to model ambiguity while simultaneously distinguishing between entity or event references. Results demonstrate EmbiText’s capacity to advance the ongoing NLP research by modelling linguistic ambiguity in computational environments instead of fully disambiguating it, thereby retaining diverse interpretations where resolution may alter meaning.
pdf
bib
abs
Human and LLM-based Assessment of Teaching Acts in Expert-led Explanatory Dialogues
Aliki Anagnostopoulou
|
Nils Feldhus
|
Yi-Sheng Hsu
|
Milad Alshomary
|
Henning Wachsmuth
|
Daniel Sonntag
Understanding the strategies that make expert-led explanations effective is a core challenge in didactics and a key goal for explainable AI. To study this computationally, we introduce ReWIRED, a large corpus of explanatory dialogues annotated by education experts with fine-grained, span-level teaching acts across five levels of explainee knowledge. We use this resource to assess the capabilities of modern language models, finding that while few-shot LLMs struggle to label these acts, fine-tuning is a highly effective methodology. Moving beyond structural annotation, we propose and validate a suite of didactic quality metrics. We demonstrate that a prompt-based evaluation using an LLM as a “judge” is required to capture how the functional quality of an explanation aligns with the learner’s expertise – a nuance missed by simpler static metrics. Together, our dataset, modeling insights, and evaluation framework provide a comprehensive methodology to bridge pedagogical principles with computational discourse analysis.
pdf
bib
abs
Where Frameworks (Dis)agree: A Study of Discourse Segmentation
Maciej Ogrodniczuk
|
Anna Latusek
|
Karolina Saputa
|
Alina Wróblewska
|
Daniel Ziembicki
|
Bartosz Żuk
|
Martyna Lewandowska
|
Adam Okrasiński
|
Paulina Rosalska
|
Anna Śliwicka
|
Aleksandra Tomaszewska
|
Sebastian Żurowski
This study addresses the fundamental task of discourse unit detection – the critical initial step in discourse parsing. We analyze how various discourse frameworks conceptualize and structure discourse units, with a focus on their underlying taxonomies and theoretical assumptions. While approaches to discourse segmentation vary considerably, the extent to which these conceptual divergences influence practical implementations remains insufficiently studied. To address this gap, we investigate similarities and differences in segmentation across several English datasets, segmented and annotated according to distinct discourse frameworks, using a simple, rule-based heuristics. We evaluate the effectiveness of rules with respect to gold-standard segmentation, while also checking variability and cross-framework generalizability. Additionally, we conduct a manual comparison of a sample of rule-based segmentation outputs against benchmark segmentation, identifying points of convergence and divergence.Our findings indicate that discourse frameworks align strongly at the level of segmentation: particular clauses consistently serve as the primary boundaries of discourse units. Discrepancies arise mainly in the treatment of other structures, such as adpositional phrases, appositions, interjections, and parenthesised text segments, which are inconsistently marked as separate discourse units across formalisms.
pdf
bib
abs
Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser
Elena Chistova
We introduce UniRST, the first unified RST-style discourse parser capable of handling 18 treebanks in 11 languages without modifying their relation inventories. To overcome inventory incompatibilities, we propose and evaluate two training strategies: Multi-Head, which assigns separate relation classification layer per inventory, and Masked-Union, which enables shared parameter training through selective label masking. We first benchmark mono-treebank parsing with a simple yet effective augmentation technique for low-resource settings. We then train a unified model and show that (1) the parameter efficient Masked-Union approach is also the strongest, and (2) UniRST outperforms 16 of 18 mono‐treebank baselines, demonstrating the advantages of a single-model, multilingual end-to-end discourse parsing across diverse resources.
pdf
bib
abs
Corpus-Oriented Stance Target Extraction
Benjamin Steel
|
Derek Ruths
Understanding public discourse through the frame of stance detection requires effective extraction of issues of discussion, or stance targets. Yet current approaches to stance target extraction are limited, only focusing on a single document to single stance target mapping. We propose a broader view of stance target extraction, which we call corpus-oriented stance target extraction. This approach considers that documents have multiple stance targets, those stance targets are hierarchical in nature, and document stance targets should not be considered in isolation of other documents in a corpus. We develop a formalization and metrics for this task, propose a new method to address this task, and show its improvement over previous methods using supervised and unsupervised metrics, and human evaluation tasks. Finally, we demonstrate its utility in a case study, showcasing its ability to aid in reliably surfacing key issues of discussion in large-scale corpuses.
pdf
bib
abs
Information-Theoretic and Prompt-Based Evaluation of Discourse Connective Edits in Instructional Text Revisions
Berfin Aktas
|
Michael Roth
We present a dataset of text revisions involving the deletion or replacement of discourse connectives. Manual annotation of a replacement subset reveals that only 19% of edits were judged either necessary or should be left unchanged, with the rest appearing optional. Surprisal metrics from GPT-2 token probabilities and prompt-based predictions from GPT-4.1 correlate with these judgments, particularly in such clear cases.