Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026)

Chloé Braud, Christian Hardmeier, Maciej Ogrodniczuk, Sharid Loaiciga, Amir Zeldes, Michal Novák, Chuyuan Li, Michael Strube, Junyi Jessy Li (Editors)



Discourse presentation is when speech, writing, or thought (SW&T) attributed to a discourse entity (such as a character in a narrative) is presented within a discourse. Discourse presentations can be generally broken into direct or indirect: direct presentation is when the text quotes the words or thoughts verbatim, whereas in indirect presentation the text expresses the SW&T in the narrator’s or writer’s own words. Automatically detecting and categorizing discourse presentations supports discourse and narrative analysis and improves attribution for downstream NLP tasks, but detecting indirect discourse presentations remains challenging due to diverse surface forms and subtle perspective shifts. We study detection and categorization of discourse presentations on a corrected version of the Semino & Short’s English Narrative SW&TP corpus. We cast the task as five-way clause classification: Direct Speech & Writing, Direct Thought, Indirect Speech & Writing, Indirect Thought, and Narrative (i.e., no discourse presentation). We compare four approaches: (1) CNN; (2) generative baseline (Claude Sonnet 4.6); (3) untuned BERT, and (4) fine-tuned BERT. The CNN baseline achieves 0.43 F1 and exhibits substantial confusion with the Narrative class. Claude achieves 0.71 F1 but performs unevenly across classes and fails to recover Indirect Thought. BERT achieves 0.81 F1 overall but struggles on indirect categories. The fine-tuning BERT yields strong performance (0.88 F1), with remaining errors concentrated in Indirect Speech & Writing (F1 = 0.60). We release our code and the corrected dataset to support reproducibility. To our knowledge, this is the first time computational approaches have been evaluated across the full range of SW&TP discourse presentation types.
The relations connecting propositions in discourse such as cause (A because B) or concession (A although B) are a subject of intense interest in Computational Linguistics and Pragmatics, but challenging to study and compare across languages. Recent progress in standardizing discourse relation inventories across datasets offers the potential to facilitate such studies, but is hindered by the complexity of relevant data and the lack of easily accessible interfaces to analyze it. In this paper we present DiscoExplorer, a new open source web interface, capable of running on local computers, which we use to make datasets from the DISRPT Shared Task on discourse relation classification publicly available, covering 16 different languages. We present the query language, search and visualization facilities for relations and signaling devices such as connectives, as well as some example studies.
Training Large Language Models (LLMs) relies predominantly on written, curated corpora, which may limit their reliability on spontaneous speech. Oral language exhibits real-time planning markers — filled pauses, repetitions, false starts, and vowel lengthenings — that modulate epistemic commitment. This pilot study investigates how such disfluencies affect the alignment between LLM confidence and a discourse-pragmatic uncertainty proxy in a Portuguese model (Llama-3.1-8B-Instruct). Using a benchmark of 344 turns from the Roda Viva corpus, we contrast faithful Conversation Analysis transcriptions with sanitized versions and combine binned divergence metrics (ECE, OE) with rank correlation and multivariate regression analyses. We find that model confidence is overwhelmingly driven by a surface feature — turn length (${\beta_{\text{std}}} = +14.47, p 0.001$) — rather than by pragmatic markers of uncertainty (${\beta_{\text{oral}}} = -3.09, {\beta_{\text{hedges}}} = -0.97$, both non-significant; $R2 = 0.29$). After controlling for length, residual effects of disfluency markers align in the human-expected direction but are dwarfed by length bias. We argue that this surface-feature dominance subsumes the pragmatic blindness phenomenon and explains the substantial divergence observed via ECE (41.95) and OE (4.29) between faithful and sanitized conditions.
Recent work representing discourse relations such as "cause" or "concession" in the framework of eRST has connected hierarchical discourse parsing to explicit connectives, such as ’because’ or ’although’, bringing the framework closer to lexicalized shallow parsing in the tradition of PDTB. However, while PDTB postulates implicit, unexpressed connectives (i.e. an implied ’although’ etc.), no such devices are recognized in eRST, and consequently next to nothing is known about the relationship between PDTB-style implicit connectives and eRST-style discourse graphs. In this paper we propose and evaluate an algorithm to align eRST data, which already indicates explicit connectives, to implicit connective annotations following the PDTB guidelines. We also conduct the first evaluation of the relationship between hierarchical RST-style relations and PDTB implicit connectives.
In this paper, we present a descriptive corpus analysis of bridging anaphora across 16 genres of English, leveraging the multi-genre GUMBridge corpus for varieties of bridging anaphora. We begin our investigation by examining the distribution of bridging instances by sub-varieties and across genres, finding that spoken genres have less bridging instances than written ones. We then investigate the linguistic environments of bridging anaphora and their corresponding associative antecedents in the underlying data of the corpus, examining both categorical features (entity type, part of speech, syntactic dependency relations) and numeric features (mention length, cluster size, salience, and distance between the bridging anaphor and antecedent). We find bridging anaphora have a tendency to be shorter and are more often definite, and bridging antecedents show a tendency to be more salient than other entities. Finally, we analyze how several of the numeric features of bridging environments vary by genre, finding consistent patterns across genres for observed trends in the environments of bridging anaphora and antecedents.
Crowdsourced data for implicit discourse relation recognition, IDRR, has been shown to contain both plausible interpretations and noisy annotations. We present a case study of dataset cartography (Swayamdipta 2020) on IDRR-focused DiscoGeM corpus (Scholman et al., 2022). Our findings show that error identification via low confidence proves unreliable, as confidence is strongly affected by label rarity. However, high-confidence datapoints reveal a different use case: auditing the cue-rich regions of the dataset. Our lexical probe demonstrates an association between high confidence items and (mostly temporal) intra-argument cue words. Dataset cartography can thus serve a diagnostic of cue-driven easy-to-learn cases, which need to be balanced out to ensure the robustness of IDRR learning.
This paper introduces a novel ’universal’ approach to discourse annotation, serving as a comprehensive synthesis of the ISO 24617-8 semantic annotation framework and a newly developed multi-layer model of coherence relations. To address the complexities of text analysis, we present a hierarchical classification and a systematic decision tree. By unifying disparate formalisms, our model provides researchers with a robust, standardised methodology for analysing complex discourse structures across various linguistic contexts.
n this work, we propose a method for dialog simulation to gather high-quality open-domain, multi-turn question answering conversations. The simulation is grounded on Stack Exchange posts and motivated by computational discourse theory. We first convert forum posts into structured directed graphs; then, different traversals through the graph represent possible conversational trajectories. Our proposed graph traversal algorithm produces dialogs optimized for conversational efficiency. In addition, we propose an evaluation framework based on Gricean conversational maxims. Expert-level human annotators evaluate 105 cooking domain transcripts according to our framework; dialogs produced by our method receive ratings that are competitive with dialogs from prior work.
Errors in automatic coreference resolution can be traced back to errors in mention detection and coreference linking. In this paper, we analyse the errors in mention detection produced by the coreference resolver CorPipe (Straka 2023). In particular, we evaluate the performance on different variants of German (written, spoken, original, and simplified). We discuss the errors against the background of the fact that the tool was trained on a combination of different coreference corpora, including two German datasets with partially conflicting annotation guidelines. The results indicate that simplification has a significant effect on mention detection independent of the modality.
Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models’ internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu & Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific model–condition combinations, suggesting that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence captured by a single evaluation paradigm. These findings highlight the central role of evaluation design in interpreting pragmatic abilities in LLMs.
This paper describes the fifth edition of the Shared Task on Multilingual Coreference Resolution, held in conjunction with the CODI-CRAC 2026 workshop. Building on previous iterations, the task required participants to develop systems capable of mention identification and identity-based coreference clustering. The 2026 edition specifically emphasizes long-range entities, defined as coreferential chains spanning significant distances, across many words and sentences. The task expanded its linguistic scope by incorporating five new datasets and two additional languages. These additions leverage version 1.4 of CorefUD, a harmonized multilingual collection comprising 27 datasets in 19 languages. In total, ten systems participated, including four LLM-based approaches (three fine-tuned models and one few-shot approach). While traditional systems still maintained their lead, LLMs demonstrated significant potential, suggesting they may soon challenge established approaches in future editions.
Participating again in this year’s edition of the CRAC shared task on coreference resolution, we present our upgraded system with an official uplift of 15.46 percentage points in CoNLL-U score. We incorporated the larger Gemma 3 27B IT model, joint pre-training, headword tagging, more efficient training and inference as well as a sliding window to achieve this result. Our system placed second in the LLM track and third overall with a primary score of 73.83. We reached the highest scores on two datasets. Finally, we compare specialized and general LLM approaches.
We present our submission to the LLM track of the 2026 Computational Models of Reference, Anaphora and Coreference (CRAC 2026) shared task. With an average CoNLL F1 score of 74.32 on the official test set, our system ranked first in the LLM track, and third overall. Our system is based on the Gemma-3-27b model, fine-tuned using a two-stage strategy with a multilingual base adapter followed by dataset-specific adapters. We represent mention spans by their headword using an XML-inspired format with local reindexing and annotate documents iteratively. These design choices proved effective across languages, document lengths, and annotation guidelines.
This paper presents _Landcore_ (LANguage Dependent COference REsolution), our submission to the LLM Track of the CRAC 2026 Shared Task on Multilingual Coreference Resolution. We explore the capabilities of LLMs in coreference resolution across multiple languages and domains, using a few-shot prompting approach. We design a comprehensive prompt that includes detailed instructions and examples and further enhance it using an LLM to produce language-specific prompts. We present an XML-inspired annotation scheme that is more suitable for LLMs than the provided formats. Although our solution is not the best-performing, we show that our ideas improve performance across various settings.
We describe PortNLP’s submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution (LLM track). Our system fine-tunes Qwen 3 14B with QLoRA on CorefUD 1.4 gold annotations across 27 corpora spanning 19 languages. Documents are processed in 500-700 character chunks with a bounded rolling context consisting of 500 characters of recent annotated text and a scored entity registry that tracks up to 30 active entities via a frequency-times-recency decay formula. We employ data augmentation and language-aware sampling strategies to handle typological and data-size diversity. Our system achieves 68.69 CoNLL F1 averaged across all 27 test corpora. We additionally present probing experiments on the LoRA adapter’s internal representations, finding that coreference signal is concentrated in attention value projections rather than MLP modules, with the strongest readout at the earliest transformer layer.
This paper describes our multilingual coreference system developed for the CRAC 2026 unconstrained track. We introduce a unified, single-model architecture based on Conditional Random Fields (CRFs) that supports 20 languages. Notably, our approach achieves multilingual resolution without the use of large language models (LLMs) or pretrained weights. In contrast to resource-intensive neural methods, the proposed model is efficient, and suitable for deployment on standard hardware (CPUs). It uses linguistic and contextual features to capture coreference relations across languages with diverse syntactic and morphological properties. Model training was conducted using the official data distributions released for the CRAC 2026 shared task. This methodology provides a robust, scalable solution for multilingual NLP, demonstrating high utility within resource-constrained environments. The results highlight that feature-driven structured models remain effective for complex cross-lingual tasks. The performance on test data is similar to the results obtained for the development data.
We introduce CorPipe 26, our winning submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. The fifth edition of this shared task focuses mainly on the comparison of generative LLMs and specialized systems; additionally, 5 more datasets and 2 new languages are introduced. CorPipe 26 is an improved version of CorPipe 25, with a new variant predicting empty nodes together with mentions and coreference links in a single model. Our system outperforms all other submissions in the LLM track by 2.8 percent points and all submissions in the unconstrained track by 9.5 percent points. Furthermore, we perform a series of ablation experiments with different model sizes, empty node prediction methods, and cross-lingual zero-shot evaluation. The source code and the trained models are publicly available at https://github.com/ufal/crac2026-corpipe.
We present DAggerCoref, our submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. DAggerCoref is a three-stage cascade built on XLM-RoBERTa-large: a gap classifier for zero pronoun detection, a mention head classifier, and a coarse-to-fine antecedent scorer. Our central contribution is applying DAgger (Ross et al., 2011) to coreference resolution: after training the antecedent scorer on gold mentions, we fine-tune on a 50/50 mix of gold and pipeline-predicted mentions, closing the train/test distribution mismatch and improving development set macro CoNLL F1 by 1.10 points. We also introduce Otsu adaptive thresholding for zero pronoun detection, which matches gold-tuned per-dataset thresholds without requiring any gold supervision. Our system achieves a macro CoNLL F1 of 67.56 on the official test set across 27 datasets and 19 languages