International Conference on Computational Semantics (2025)

Volumes

Proceedings of the 16th International Conference on Computational Semantics 30 papers
Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2) 10 papers
Proceedings of the Second International Workshop on Construction Grammars and NLP 22 papers
Proceedings of the 21st Joint ACL - ISO Workshop on Interoperable Semantic Annotation (ISA-21) 10 papers

pdf (full)
bib (full) Proceedings of the 16th International Conference on Computational Semantics

pdf bib
Proceedings of the 16th International Conference on Computational Semantics
Kilian Evang | Laura Kallmeyer | Sylvain Pogodalla

pdf bib abs
Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data
Annika Tjuka | Robert Forkel | Christoph Rzymski | Johann-Mattis List

Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.

During Human Robot Interactions in disaster relief scenarios, Large Language Models (LLMs) have the potential for substantial physical reasoning to assist in mission objectives. However, these reasoning capabilities are often found only in larger models, which are not currently reasonable to deploy on robotic systems due to size constraints. To meet our problem space requirements, we introduce a dataset and pipeline to create Field Reasoning and Instruction Decoding Agent (FRIDA) models. In our pipeline, domain experts and linguists combine their knowledge to make high-quality, few-shot prompts used to generate synthetic data for fine-tuning. We hand-curate datasets for this few-shot prompting and for evaluation to improve LLM reasoning on both general and disaster-specific objects. We concurrently run an ablation study to understand which kinds of synthetic data most affect performance. We fine-tune several small instruction-tuned models and find that ablated FRIDA models only trained on objects’ physical state and function data outperformed both the FRIDA models trained on all synthetic data and the base models in our evaluation. We demonstrate that the FRIDA pipeline is capable of instilling physical common sense with minimal data.

pdf bib abs
SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling
Dong Liu | Yanxuan Yu

Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose SemToken, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can be seamlessly integrated with modern language models and attention acceleration methods. Experiments on long-context language modeling benchmarks such as WikiText-103 and LongBench show that SemToken achieves up to 2.4× reduction in token count and 1.9× speedup, with negligible or no degradation in perplexity and downstream accuracy. Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.

pdf bib abs
ding-01 :ARG0: An AMR Corpus for Spontaneous French Dialogue
Jeongwoo Kang | Maria Boritchev | Maximin Coavoux

We present our work to build a French semantic corpus by annotating French dialogue in Abstract Meaning Representation (AMR).Specifically, we annotate the DinG corpus, consisting of transcripts of spontaneous French dialogues recorded during the board game Catan. As AMR has insufficient coverage of the dynamics of spontaneous speech, we extend the framework to better represent spontaneous speech and sentence structures specific to French. Additionally, to support consistent annotation, we provide an annotation guideline detailing these extensions. We publish our corpus under a free license (CC-SA-BY). We also train and evaluate an AMR parser on our data. This model can be used as an assistance annotation tool to provide initial annotations that can be refined by human annotators. Our work contributes to the development of semantic resources for French dialogue.

pdf bib abs
A Graph Autoencoder Approach for Gesture Classification with Gesture AMR
Huma Jamil | Ibrahim Khebour | Kenneth Lai | James Pustejovsky | Nikhil Krishnaswamy

We present a novel graph autoencoder (GAE) architecture for classifying gestures using Gesture Abstract Meaning Representation (GAMR), a structured semantic annotation framework for gestures in collaborative tasks. We leverage the inherent graphical structure of GAMR by employing Graph Neural Networks (GNNs), specifically an Edge-aware Graph Attention Network (EdgeGAT), to learn embeddings of gesture semantic representations. Using the EGGNOG dataset, which captures diverse physical gesture forms expressing similar semantics, we evaluate our GAE on a multi-label classification task for gestural actions. Results indicate that our approach significantly outperforms naive baselines and is competitive with specialized Transformer-based models like AMRBART, despite using considerably fewer parameters and no pretraining. This work highlights the effectiveness of structured graphical representations in modeling multimodal semantics, offering a scalable and efficient approach to gesture interpretation in situated human-agent collaborative scenarios.

pdf bib abs
Retrieval-Augmented Semantic Parsing: Improving Generalization with Lexical Knowledge
Xiao Zhang | Qianru Meng | Johan Bos

Open-domain semantic parsing remains a challenging task, as neural models often rely on heuristics and struggle to handle unseen concepts. In this paper, we investigate the potential of large language models (LLMs) for this task and introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective approach that integrates external symbolic knowledge into the parsing process. Our experiments not only show that LLMs outperform previous encoder-decoder baselines for semantic parsing, but that RASP further enhances their ability to predict unseen concepts, nearly doubling the performance of previous models on out-of-distribution concepts. These findings highlight the promise of leveraging large language models and retrieval mechanisms for robust and open-domain semantic parsing.

pdf bib abs
Not Just Who or What: Modeling the Interaction of Linguistic and Annotator Variation in Hateful Word Interpretation
Sanne Hoeken | Özge Alacam | Dong Nguyen | Massimo Poesio | Sina Zarrieß

Interpreting whether a word is hateful in context is inherently subjective. While growing research in NLP recognizes the importance of annotation variation and moves beyond treating it as noise, most work focuses primarily on annotator-related factors, often overlooking the role of linguistic context and its interaction with individual interpretation.In this paper, we investigate the factors driving variation in hateful word meaning interpretation by extending the HateWiC dataset with linguistic and annotator-level features. Our empirical analysis shows that variation in annotations is not solely a function of who is interpreting or what is being interpreted, but of the interaction between the two. We evaluate how well models replicate the patterns of human variation. We find that incorporating annotator information can improve alignment with human disagreement but still underestimates it. Our findings further demonstrate that capturing interpretation variation requires modeling the interplay between annotators and linguistic content and that neither surface-level agreement nor predictive accuracy alone is sufficient for truly reflecting human variation.

pdf bib abs
Context Effects on the Interpretation of Complement Coercion: A Comparative Study with Language Models in Norwegian
Matteo Radaelli | Emmanuele Chersoni | Alessandro Lenci | Giosuè Baggio

In complement coercion sentences, like *John began the book*, a covert event (e.g., reading) may be recovered based on lexical meanings, world knowledge, and context. We investigate how context influences coercion interpretation performance for 17 language models (LMs) in Norwegian, a low-resource language. Our new dataset contained isolated coercion sentences (context-neutral), plus the same sentences with a subject NP that suggests a particular covert event and sentences that have a similar effect but that precede or follow the coercion sentence. LMs generally benefit from contextual enrichment, but performance varies depending on the model. Models that struggled in context-neutral sentences showed greater improvements from contextual enrichment. Subject NPs and pre-coercion sentences had the largest effect in facilitating coercion interpretation.

pdf bib abs
LLMs Struggle with NLI for Perfect Aspect: A Cross-Linguistic Study in Chinese and Japanese
Lu Jie | Du Jin | Hitomi Yanaka

Unlike English, which uses distinct forms (e.g., had, has, will have) to mark the perfect aspect across tenses, Chinese and Japanese lack sep- arate grammatical forms for tense within the perfect aspect, which complicates Natural Lan- guage Inference (NLI). Focusing on the per- fect aspect in these languages, we construct a linguistically motivated, template-based NLI dataset (1,350 pairs per language). Experi- ments reveal that even advanced LLMs strug- gle with temporal inference, particularly in de- tecting subtle tense and reference-time shifts. These findings highlight model limitations and underscore the need for cross-linguistic evalua- tion in temporal semantics. Our dataset is avail- able at https://github.com/Lujie2001/ CrossNLI.

pdf bib abs
Assessing LLMs’ Understanding of Structural Contrasts in the Lexicon
Shuxu Li | Antoine Venant | Philippe Langlais | François Lareau

We present a new benchmark to evaluate the lexical competence of large language models (LLMs), built on a hierarchical classification of lexical functions (LFs) within the Meaning-Text Theory (MTT) framework. Based on a dataset called French Lexical Network (LN-fr), the benchmark employs contrastive tasks to probe the models’ sensitivity to fine-grained paradigmatic and syntagmatic distinctions. Our results show that performance varies significantly across different LFs and systematically declines with increased distinction granularity, highlighting current LLMs’ limitations in relational and structured lexical understanding.

pdf bib abs
A German WSC dataset comparing coreference resolution by humans and machines
Wiebke Petersen | Katharina Spalek

We present a novel German Winograd-style dataset for direct comparison of human and model behavior in coreference resolution. Ten participants per item provided accuracy, confidence ratings, and response times. Unlike classic WSC tasks, humans select among three pronouns rather than between two potential antecedents, increasing task difficulty. While majority vote accuracy is high, individual responses reveal that not all items are trivial and that variability is obscured by aggregation. Pretrained language models evaluated without fine-tuning show clear performance gaps, yet their accuracy and confidence scores correlate notably with human data, mirroring certain patterns of human uncertainty and error. Dataset-specific limitations, including pragmatic reinterpretations and imbalanced pronoun distributions, highlight the importance of high-quality, balanced resources for advancing computational and cognitive models of coreference resolution.

pdf bib abs
Finding Answers to Questions: Bridging between Type-based and Computational Neuroscience Approaches
Staffan Larsson | Jonathan Ginzburg | Robin Cooper | Andy Lücking

The paper outlines an account of how the brain might process questions and answers in linguistic interaction, focusing on accessing answers in memory and combining questions and answers into propositions. To enable this, we provide an approximation of the lambda calculus implemented in the Semantic Pointer Architecture (SPA), a neural implementation of a Vector Symbolic Architecture. The account builds a bridge between the type-based accounts of propositions in memory (as in the treatments of belief by Ranta (1994) and Cooper (2023) and the suggestion for question answering made by Eliasmith (2013) question answering is described in terms of transformations of structured representations in memory providing an answer. We will take such representations to correspond to beliefs of the agent. On Cooper’s analysis, beliefs are considered to be types which have a record structure closely related to the structure which Eliasmith codes in vector representations (Larsson et al, 2023). Thus the act of answering a question can be seen to have a neural base in a vector transformation translatable in Eliasmith’s system to activity of spiking neurons and to correspond to using an item in memory (abelief) to provide an answer to the question.

pdf bib abs
Can Large Language Models Robustly Perform Natural Language Inference for Japanese Comparatives?
Yosuke Mikami | Daiki Matsuoka | Hitomi Yanaka

Large Language Models (LLMs) perform remarkably well in Natural Language Inference (NLI).However, NLI involving numerical and logical expressions remains challenging.Comparatives are a key linguistic phenomenon related to such inference, but the robustness of LLMs in handling them, especially in languages that are not dominant in the models’ training data, such as Japanese, has not been sufficiently explored.To address this gap, we construct a Japanese NLI dataset that focuses on comparatives and evaluate various LLMs in zero-shot and few-shot settings.Our results show that the performance of the models is sensitive to the prompt formats in the zero-shot setting and influenced by the gold labels in the few-shot examples.The LLMs also struggle to handle linguistic phenomena unique to Japanese.Furthermore, we observe that prompts containing logical semantic representations help the models predict the correct labels for inference problems that they struggle to solve even with few-shot examples.

pdf bib abs
Is neural semantic parsing good at ellipsis resolution, or isn’t it?
Xiao Zhang | Johan Bos

Neural semantic parsers have shown good overall performance for a variety of linguistic phenomena, reaching semantic matching scores of more than 90%. But how do such parsers perform on strongly context-sensitive phenomena, where large pieces of semantic information need to be duplicated to form a meaningful semantic representation? A case in point is English verb phrase ellipsis, a construct where entire verb phrases can be abbreviated by a single auxiliary verb. Are the otherwise known as powerful semantic parsers able to deal with ellipsis or aren’t they? We constructed a corpus of 120 cases of ellipsis with their fully resolved meaning representation and used this as a challenge set for a large battery of neural semantic parsers. Although these parsers performed very well on the standard test set, they failed in the instances with ellipsis. Data augmentation helped improve the parsing results. The reason for the difficulty of parsing elided phrases is not that copying semantic material is hard, but that usually occur in linguistically complicated contexts causing most of the parsing errors.

pdf bib abs
Extracting Behaviors from German Clinical Interviews in Support of Autism Spectrum Diagnosis
Margareta A. Kulcsar | Ian Paul Grant | Massimo Poesio

Accurate identification of behaviors is essential for diagnosing developmental disorders such as Autism Spectrum Disorder (ASD). We frame the extraction of behaviors from text as a specialized form of event extraction grounded in the TimeML framework and evaluate two approaches: a pipeline model and an end-to-end model that directly extracts behavior spans from raw text. We introduce two novel datasets: a new clinical annotation of an existing Reddit corpus of parent-authored posts in English and a clinically annotated corpus of German ASD diagnostic interviews. On the English dataset, the end-to-end BERT model achieved an F1 score of 73.4% in behavior classification, outperforming the pipeline models (F1: 66.8% and 53.65%). On the German clinical dataset, the end-to-end model reached an even higher F1 score of 80.1%, again outperforming the pipeline (F1: 78.7%) and approaching the gold-annotated upper bound (F1: 92.9%). These results demonstrate that behavior classification benefits from direct extraction, and that our method generalizes across domains and languages.

pdf bib abs
The Proper Treatment of Verbal Idioms in German Discourse Representation Structure Parsing
Kilian Evang | Rafael Ehren | Laura Kallmeyer

Existing datasets for semantic parsing lack adequate representations of potentially idiomatic expressions (PIEs), i.e., expressions consisting of two or more lexemes that can occur with either a literal or an idiomatic reading. As a result, we cannot test semantic parsers for their ability to correctly distinguish between the two cases, and to assign appropriate meaning representations. We address this situation by combining two semantically annotated resources to obtain a corpus of German sentences containing literal and idiomatic occurrences of PIEs, paired with meaning representations whose concepts and roles reflect the respective literal or idiomatic meaning. Experiments with a state-of-the-art semantic parser show that given appropriate training data, it can learn to predict the idiomatic meanings and improve performance also for literal readings, even though predicting the correct concepts in context remains challenging. We provide additional insights through evaluation on synthetic data.

pdf bib abs
Does discourse structure help action prediction? A look at Correction Triangles.
Kate Thompson | Akshay Chaturvedi | Nicholas Asher

An understanding of natural language corrections is essential for artificial agents that are meant to collaborate and converse with humans. We present some preliminary experiments using language-to-action models investigating whether discourse structure, in particular Correction relations, improves the action prediction capabilities of language-to-action models for simple block world tasks. We focus on scenarios in which a model must correct a previous action, and present a corpus of synthetic dialogues to help explain model performance.

pdf bib abs
FAMWA: A new taxonomy for classifying word associations (which humans improve at but LLMs still struggle with)
Maria A. Rodriguez | Marie Candito | Richard Huyghe

Word associations have a longstanding tradition of being instrumental for investigating the organization of the mental lexicon. Despite their wide application in psychology and psycholinguistics, analyzing word associations remains challenging due to their inherent heterogeneity and variability, shaped by linguistic and extralinguistic factors. Existing word-association taxonomies often suffer limitations due to a lack of comprehensive frameworks that capture their complexity.To address these limitations, we introduce a linguistically motivated taxonomy consisting of co-existing meaning-related and form-related relations, while accounting for the directionality of word associations.We applied the taxonomy to a dataset of 1,300 word associations (FAMWA) and assessed it using various LLMs, analyzing their ability to classify word associations.The results show an improved inter-annotator agreement for our taxonomies compared to previous studies (𝜅 = .60 for meaning and 𝜅 = .58 for form). However, models such as GPT-4o perform only modestly in relation labeling (with accuracies of 46.2% for meaning and 78.3% for form), which calls into question their ability to fully grasp the underlying principles of human word associations.

pdf bib abs
Computational Semantics Tools for Glue Semantics
Mark-Matthias Zymla | Mary Dalrymple | Agnieszka Patejuk

This paper introduces a suite of computational semantic tools for Glue Semantics, an approach to compositionality developed in the context of Lexical Functional Grammar (LFG), but applicable to a variety of syntactic representations, including Universal Dependencies (UD). The three tools are: 1) a Glue Semantics prover, 2) an interface between this prover and a platform for implementing LFG grammars, and 3) a system to rewrite and add semantic annotations to LFG and UD syntactic analyses, with a native support for the prover. The main use of these tools is computational verification of theoretical linguistic analyses, but they have also been used for teaching formal semantic concepts.

pdf bib abs
Which Model Mimics Human Mental Lexicon Better? A Comparative Study of Word Embedding and Generative Models
Huacheng Song | Zhaoxin Feng | Emmanuele Chersoni | Chu-Ren Huang

Word associations are commonly applied in psycholinguistics to investigate the nature and structure of the human mental lexicon, and at the same time an important data source for measuring the alignment of language models with human semantic representations.Taking this view, we compare the capacities of different language models to model collective human association norms via five word association tasks (WATs), with predictions about associations driven by either word vector similarities for traditional embedding models or prompting large language models (LLMs).Our results demonstrate that neither approach could produce human-like performances in all five WATs. Hence, none of them can successfully model the human mental lexicon yet. Our detailed analysis shows that static word-type embeddings and prompted LLMs have overall better alignment with human norms compared to word-token embeddings from pretrained models like BERT. Further analysis suggests that the performance discrepancies may be due to different model architectures, especially in terms of approximating human-like associative reasoning through either semantic similarity or relatedness evaluation. Our codes and data are publicly available at: https://github.com/florethsong/word_association.

pdf bib abs
Semantic Analysis Experiments for French Citizens’ Contribution : Combinations of Language Models and Community Detection Algorithms
Sami Guembour | Dominguès Dominguès | Sabine Ploux

Following the Yellow Vest crisis that occurred in France in 2018, the French government launched the Grand Débat National, which gathered citizens’ contributions.This paper presents a semantic analysis of these contributions by segmenting them into sentences and identifying the topics addressed using clustering techniques. The study tests several combinations of French language models and community detection algorithms, aiming to identify the most effective pairing for grouping sentences based on thematic similarity. Performance is evaluated using the number of clusters generated and standard clustering metrics.Principal Component Analysis (PCA) is employed to assess the impact of dimensionality reduction on sentence embeddings and clustering quality. Cluster merging methods are also developed to reduce redundancy and improve the relevance of the identified topics.Finally, the results help refine semantic analysis and shed light on the main concerns expressed by citizens.

pdf bib abs
Neurosymbolic AI for Natural Language Inference in French : combining LLMs and theorem provers for semantic parsing and natural language reasoning
Maximos Skandalis | Lasha Abzianidze | Richard Moot | Christian Retoré | Simon Robillard

In this article, we describe the first comprehensive neurosymbolic pipeline for the task of Natural Language Inference (NLI) for French, with the synergy of Large Language Models (CamemBERT) and automated theorem provers (GrailLight, LangPro). LLMs prepare the input for GrailLight by tagging each token with Part-of-Speech and grammatical information based on the Type-Logical Grammar formalism. GrailLight then produces the lambda-terms given as input to the LangPro theorem prover, a tableau-based theorem prover for natural logic originally developped for English. Currently, the proposed system works on the French version of SICK dataset. The results obtained are comparable to the ones on the English and Dutch versions of SICK with the same LangPro theorem prover, and are better than the results of recent transformers on this specific dataset.Finally, we have identified ways to further improve the results obtained, such as giving access to the theorem prover to lexical knowledge via a knowledge base for French.

pdf bib abs
ProPara-CRTS: Canonical Referent Tracking for Reliable Evaluation of Entity State Tracking in Process Narratives
Bingyang Ye | Timothy Obiso | Jingxuan Tu | James Pustejovsky

Despite the abundance of datasets for procedural texts such as cooking recipes, resources that capture full process narratives, paragraph-long descriptions that follow how multiple entities evolve across a sequence of steps, remain scarce.Although synthetic resources offer useful toy settings, they fail to capture the linguistic variability of naturally occurring prose. ProPara remains the only sizeable, naturally occurring corpus of process narratives, yet ambiguities and inconsistencies in its schema and annotations hinder reliable evaluation of its core task Entity State Tracking (EST).In this paper, we introduce a Canonical Referent Tracking Schema (CRTS) that assigns every surface mention to a unique, immutable discourse referent and records that referent’s existence and location at each step. Applying CRTS to ProPara, we release the re-annotated result as ProPara-CRTS. The new corpus resolves ambiguous participant mentions in ProPara and consistently boosts performance across a variety of models.This suggests that principled schema design and targeted re-annotation can unlock measurable improvements in EST, providing a sharper diagnostic of model capacity in process narratives understanding without any changes to model architecture.

pdf bib abs
The Difficult Case of Intended and Perceived Sarcasm: a Challenge for Humans and Large Language Models
Hyewon Jang | Diego Frassinelli

We examine the cases of failed communication in sarcasm, defined as ‘the discrepancy between what speakers and observers perceive as sarcasm’. We identify factors that are associated with such failures, and how those difficult instances affect the detection performance of encoder-only and decoder-only generative models. We find that speakers’ incongruity between their felt annoyance and sarcasm in their utterance is highly correlated with sarcasm that fails to be communicated to human observers. This factor also relates to the drop of classification performance of large language models (LLMs). Additionally, disagreement among multiple observers about sarcasm is correlated with poorer performance of LLMs. Finally, we find that generative models produce better results with ground-truth labels from speakers than from observers, in contrast to encoder-only models, which suggests a general tendency by generative models to identify with speakers’ perspective by default.

pdf bib abs
A Model of Information State in Situated Multimodal Dialogue
Kenneth Lai | Lucia Donatelli | Richard Brutti | James Pustejovsky

In a successful dialogue, participants come to a mutual understanding of the content being communicated through a process called conversational grounding. This can occur through language, and also via other communicative modalities like gesture. Other kinds of actions also give information as to what has been understood from the dialogue. Moreover, achieving common ground not only involves establishing agreement on a set of facts about discourse referents, but also agreeing on what those entities refer to in the outside world, i.e., situated grounding. We use examples from a corpus of multimodal interaction in a task-based setting, annotated with Abstract Meaning Representation (AMR), to explore how speech, gesture, and action contribute to the construction of common ground. Using a simple model of information state, we discuss ways in which existing annotation schemes facilitate this analysis, as well as information that current annotations do not yet capture. Our research sheds light on the interplay between language, gesture, and action in multimodal communication.

pdf bib abs
Learning to Refer: How Scene Complexity Affects Emergent Communication in Neural Agents
Dominik Künkele | Simon Dobnik

We explore how neural network-based agents learn to map continuous sensory input to discrete linguistic symbols through interactive language games. One agent describes objects in 3D scenes using invented vocabulary; the other interprets references based on attributes like shape, color, and size. Learning is guided by feedback from successful interactions. We extend the CLEVR dataset with more complex scenes to study how increased referential complexity impacts language acquisition and symbol grounding in artificial agents.

pdf bib abs
On the Role of Linguistic Features in LLM Performance on Theory of Mind Tasks
Ekaterina Kozachenko | Gonçalo Guiomar | Karolina Stanczak

Theory of Mind presents a fundamental challenge for Large Language Models (LLMs), revealing gaps in processing intensional contexts where beliefs diverge from reality. We analyze six LLMs across 2,860 annotated stories, measuring factors such as idea density, mental state verb distribution, and perspectival complexity markers. Notably, and in contrast to humans, we find that LLMs show positive correlations with linguistic complexity. In fact, they achieve high accuracy (74-95%) on high complexity stories with explicit mental state scaffolding, yet struggle with low complexity tasks requiring implicit reasoning (51-77%). Furthermore, we find that linguistic markers systematically influence performance, with contrast markers decreasing accuracy by 5-9% and knowledge verbs increasing it by 4-10%. This inverse relationship between linguistic complexity and performance, contrary to human cognition, may suggest that current LLMs rely on surface-level linguistic cues rather than genuine mental state reasoning.

pdf bib abs
Mapping Semantic Domains Across India’s Social Media: Networks, Geography, and Social Factors
Gunjan Anand | Jonathan Dunn

This study examines socially-conditioned variation within semantic domains like kinship and weather using thirteen Indian cities as a case-study. Using bilingual social media data, we infer six semantic domains from corpora representing individual cities with a lexicon including terms from English, Hindi and Transliterated Hindi. The process of inferring semantic domains uses character-based embeddings to retrieve nearest neighbors and Jaccard similarity to operationalize the edge weights between lexical items within each domain. These representations reveal distinct regional variation across all six domains. We then examine the relationship between variation in semantic domains and external social factors such as literacy rates and local demographics. The results show that semantic domains exhibit systematic influences from sociolinguistic factors, a finding that has significant implications for the idea that semantic domains can be studied as abstractions distinct from specific speech communities.

pdf bib abs
Disentangling lexical and grammatical information in word embeddings
Li Liu | François Lareau

To enable finer-grained linguistic analysis, we propose a method for the separation of lexical and grammatical information within contextualized word embeddings. Using CamemBERT embeddings for French, we apply our method to 14,472 inflected word forms extracted from the Lexical Network of French ( LN-fr ), covering 1,468 nouns, 202 adjectives and 299 verbs inflected via 14 distinct grammatical feature values. Our iterative distillation alternates two steps until convergence: (i) estimating lexical or grammatical vectors by averaging the embeddings of words that share the same lexeme or grammatical feature value, and (ii) isolating the complementary component of each word embedding by subtracting the estimated vector. To assess the quality of the decomposition, we measure whether the resulting lexical and grammatical vectors form more compact clusters within their respective groups and whether their sum better reconstructs the original word embeddings. All evaluations rely on L2 distance. The observed improvements in both clustering and reconstruction accuracy demonstrate the effectiveness of our approach.

pdf (full)
bib (full) Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2)

pdf bib
Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2)
Timothée Bernard | Timothee Mickus

pdf bib abs
Natural Language Inference with CCG Parser and Automated Theorem Prover for DTS
Asa Tomita | Mai Matsubara | Hinari Daido | Daisuke Bekki

We propose a Natural Language Inference (NLI) system based on compositional semantics. The system combines lightblue, a syntactic and semantic parser grounded in Combinatory Categorial Grammar (CCG) and Dependent Type Semantics (DTS), with wani, an automated theorem prover for Dependent Type Theory (DTT). Because each computational step reflects a theoretical assumption, system evaluation serves as a form of hypothesis verification. We evaluate the inference system using the Japanese Semantic Test Suite JSeM, and demonstrate how error analysis provides feedback to improve both the system and the underlying linguistic theory.

pdf bib abs
Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance
Timothy Pistotti | Jason Brown | Michael J. Witbrock

Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.

pdf bib abs
Modal Subordination in Dependent Type Semantics
Aoi Iimura | Teruyuki Mizuno | Daisuke Bekki

In the field of natural language processing, the construction of “linguistic pipelines”, which draw on insights from theoretical linguistics, stands in a complementary relationship to the prevailing paradigm of large language models. The rapid development of these pipelines has been fueled by recent advancements, including the emergence of Dependent Type Semantics (DTS) — a type-theoretic framework for natural language semantics. While DTS has been successfully applied to analyze complex linguistic phenomena such as anaphora and presupposition, its capability to account for modal expressions remains an underexplored area. This study aims to address this gap by proposing a framework that extends DTS with modal types. This extension broadens the scope of linguistic phenomena that DTS can account for, including an analysis of modal subordination, where anaphora interacts with modal expressions.

pdf bib abs
Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments
Timothy Pistotti | Jason Brown | Michael J. Witbrock

Recent studies probing the Argument from the Poverty of the Stimulus (APS) have applied Large Language Models (LLMs) to test the learnability of complex syntax through surprisal-based metrics. However, divergent conclusions raise questions concerning the insights these metrics offer. While Wilcox et al. (2024) used direct minimal pair comparisons (the “wh-effect”) to demonstrate that models successfully generalise knowledge of filler-gap dependencies, Lan et al. (2024) used a Difference-in-Differences (DiD) metric and found that models largely fail on parasitic gaps (PGs). This paper argues that the direct minimal pair approach offers greater diagnostic transparency. We demonstrate this by generating a full 8-permutation paradigm of refined PG stimuli and evaluating the GPT-2 model used in previous studies with a systematic Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles even in complex PG environments. This finding, which contrasts with the more ambiguous results from DiD-style metrics, suggests that the choice of evaluation metric is critical for assessing an LLM’s syntactic competence.

pdf bib abs
Coordination of Theoretical and Computational Linguistics
Adam Przepiórkowski | Agnieszka Patejuk

The aim of this paper is to present a case study of a fruitful and, hopefully, inspiring interaction between formal and computational linguistics. A variety of NLP tools and resources have been used in linguistic investigations of the symmetry of coordination, leading to novel theoretical arguments. The converse impact of theoretical results on NLP work has been successful only in some cases.

pdf bib abs
An instructive implementation of semantic parsing and reasoning using Lexical Functional Grammar
Mark-Matthias Zymla | Kascha Kruschwitz | Paul Zodl

This paper presents a computational resource for exploring semantic parsing and reasoning through a strictly formal lense. Inspired by the framework of Lexical Functional Grammar, our system allows for modular exploration of different aspects of semantic parsing. It consists of a hand-coded formal grammar combining syntactic and semantic annotations, producing basic semantic representations. The system provides the option to extend these basic semantics via rewrite rules in a principled fashion to explore more complex reasoning. The result is a layered system enabling an incremental approach to semantic parsing. We illustrate this approach with examples from the Fracas testsuite demonstrating its overall functionality and viability.

pdf bib abs
Modelling Expectation-based and Memory-based Predictors of Human Reading Times with Syntax-guided Attention
Lukas Mielczarek | Timothée Bernard | Laura Kallmeyer | Katharina Spalek | Benoit Crabbé

The correlation between reading times and surprisal is well known in psycholinguistics and is easy to observe. There is also a correlation between reading times and structural integration, which is, however, harder to detect (Gibson, 2000). This correlation has been studied using parsing models whose outputs are linked to reading times. In this paper, we study the relevance of memory-based effects in reading times and how to predict them using neural language models. We find that integration costs significantly improve surprisal-based reading time prediction. Inspired by Timkey and Linzen (2023), we design a small-scale autoregressive transformer language model in which attention heads are supervised by dependency relations. We compare this model to a standard variant by checking how well each model’s outputs correlate with human reading times and find that predicted attention scores can be effectively used as proxies for syntactic integration costs to predict self-paced reading times.

pdf bib abs
On the relative impact of categorical and semantic information on the induction of self-embedding structures
Antoine Venant | Yutaka Suzuki

We investigate the impact of center embedding and selectional restrictions on neural latent tree models’ tendency to induce self-embedding structures. To this aim we compare their behavior in different controlled artificial environments involving noun phrases modified by relative clauses, with different quantity of available training data. Our results provide evidence that the existence of multiple center self-embedding is a stronger incentive than selectional restrictions alone, but that the combination of both is the best incentive overall. We also show that different architectures benefit very differently from these incentives.

pdf bib abs
Plural Interpretive Biases: A Comparison Between Human Language Processing and Language Models
Jia Ren

Human communication routinely relies on plural predication, and plural sentences are often ambiguous (see, e.g., Scha, 1984; Dalrymple et al., 1998a, to name a few). Building on extensive theoretical and experimental work in linguistics and philosophy, we ask whether large language models (LLMs) exhibit the same interpretive biases that humans show when resolving plural ambiguity. We focus on two lexical factors: (i) the collective bias of certain predicates (e.g., size/shape adjectives) and (ii) the symmetry bias of predicates. To probe these tendencies, we apply two complementary methods to premise–hypothesis pairs: an embedding-based heuristic using OpenAI’s text-embedding-3-large/small (OpenAI, 2024, 2025) with cosine similarity, and supervised NLI models (bart-large-mnli, roberta-large-mnli) (Lewis et al., 2020; Liu et al., 2019; Williams et al., 2018a; Facebook AI, 2024b,a) that yield asymmetric, calibrated entailment probabilities. Results show partial sensitivity to predicate-level distinctions, but neither method reproduces the robust human pattern, where neutral predicates favor entailment and strongly non-symmetric predicates disfavor it. These findings highlight both the potential and the limits of current LLMs: as cognitive models, they fall short of capturing human-like interpretive biases; as engineering systems, their representations of plural semantics remain unstable for tasks requiring precise entailment.

pdf (full)
bib (full) Proceedings of the Second International Workshop on Construction Grammars and NLP

pdf bib abs
A Computational Construction Grammar Framework for Modelling Signed Languages
Liesbet De Vos | Paul Van Eecke | Katrien Beuls

Constructional approaches to signed languages are becoming increasingly popular within sign language linguistics. Current approaches, however, focus primarily on theoretical description, while formalization and computational implementation remain largely unexplored. This paper provides an initial step towards addressing this gap by studying and operationalizing the core mechanisms required for representing and processing manual signed forms using computational construction grammar. These include a phonetic representation of individual manual signs and a formal representation of the complex temporal synchronization patterns between them. The implemented mechanisms are integrated into Fluid Construction Grammar and are available as a module within the Babel software library. Through an interactive web demonstration, we illustrate how this module lays the groundwork for future computational exploration of constructions that bidirectionally map between signed forms and their meanings.

pdf bib abs
LLMs Learn Constructions That Humans Do Not Know
Jonathan Dunn | Mai Mohamed Eida

This paper investigates false positive constructions: grammatical structures which an LLM hallucinates as distinct constructions but which human introspection does not support. Both a behavioural probing task using contextual embeddings and a meta-linguistic probing task using prompts are included, allowing us to distinguish between implicit and explicit linguistic knowledge. Both methods reveal that models do indeed hallucinate constructions. We then simulate hypothesis testing to determine what would have happened if a linguist had falsely hypothesized that these hallucinated constructions do exist. The high accuracy obtained shows that such false hypotheses would have been overwhelmingly confirmed. This suggests that construction probing methods suffer from a confirmation bias and raises the issue of what unknown and incorrect syntactic knowledge these models also possess.

pdf bib abs
Modeling Constructional Prototypes with Sentence-BERT
Yuri V. Yerastov

This paper applies Sentence-Bert embeddings to the analysis of three competing constructions in Canadian English: be perfect, predicate adjective and have perfect. Samples are drawn from a Canadian news media database. Constructional exemplars are vectorized and mean-pooled to create constructional centroids, from which top-ranked exemplars and cross-construction similarities are calculated. Clause type distribution and definiteness marking are also examined. The embeddings-based analysis is cross-validated by a traditional quantitative study, and both lines of inquiry converge on the following tendencies: (1) prevalence of embedded – and particularly adverbial – clauses in the be perfect and predicate adjective constructions, (2) prevalence of matrix clauses in the have perfect, (3) prevalence of definiteness marking in the direct object of the be perfect, and (4) greater statistical similarities between be perfects and predicate adjectives. These findings support the argument that be perfects function as topic-marking constructions within a usage-based framework.

pdf bib abs
Construction-Grammar Informed Parameter Efficient Fine-Tuning for Language Models
Prasanth

Large language models excel at statistical pattern recognition but may lack explicit understanding of constructional form-meaning correspondences that characterize human grammatical competence. This paper presents Construction-Aware LoRA (CA-LoRA), a parameter-efficient fine-tuning method that incorporates constructional templates through specialized loss functions and targeted parameter updates. We focus on five major English construction types: ditransitive, caused-motion, resultative, way-construction, and conative. Evaluation on BLiMP, CoLA, and SyntaxGym shows selective improvements: frequent patterns like ditransitive and caused-motion show improvements of approximately 3.5 percentage points, while semi-productive constructions show minimal benefits (1.2 points). Overall performance improves by 1.8% on BLiMP and 1.6% on SyntaxGym, while maintaining competitive performance on general NLP tasks. Our approach requires only 1.72% of trainable parameters and reduces training time by 67% compared to full fine-tuning. This work demonstrates that explicit constructional knowledge can be selectively integrated into neural language models, with effectiveness dependent on construction frequency and structural regularity.

pdf bib abs
ASC analyzer: A Python package for measuring argument structure construction usage in English texts
Hakyung Sung | Kristopher Kyle

Argument structure constructions (ASCs) offer a theoretically grounded lens for analyzing second language (L2) proficiency, yet scalable and systematic tools for measuring their usage remain limited. This paper introduces the ASC analyzer, a publicly available Python package designed to address this gap. The analyzer automatically tags ASCs and computes 50 indices that capture diversity, proportion, frequency, and ASC-verb lemma association strength. To demonstrate its utility, we conduct both bivariate and multivariate analyses that examine the relationship between ASC-based indices and L2 writing scores.

pdf bib abs
Verbal Predication Constructions in Universal Dependencies
William Croft | Joakim Nivre

Is the framework of Universal Dependencies (UD) compatible with findings from linguistic typology about constructions in the world’s languages? To address this question, we need to systematically review how UD represents these constructions, and how it handles the range of morphosyntactic variation attested across languages. In this paper, we present the results of such a review focusing on verbal predication constructions. We find that, although UD can represent all major constructions in this area, the guidelines are not completely coherent with respect to the criteria for core argument relations and not completely systematic in the definition of subtypes for nonbasic voice constructions. To improve the overall coherence of the guidelines, we propose a number of revisions for future versions of UD.

pdf bib abs
Linguistic Generalizations are not Rules: Impacts on Evaluation of LMs
Leonie Weissweiler | Kyle Mahowald | Adele E. Goldberg

Linguistic evaluations of how well LMs generalize to produce or understand novel text often implicitly take for granted that natural languages are generated by symbolic rules. Grammaticality is thought to be determined by whether sentences obey such rules. Interpretation is believed to be compositionally generated by syntactic rules operating on meaningful words. Semantic parsing is intended to map sentences into formal logic. Failures of LMs to obey strict rules have been taken to reveal that LMs do not produce or understand language like humans. Here we suggest that LMs’ failures to obey symbolic rules may be a feature rather than a bug, because natural languages are not based on rules. New utterances are produced and understood by a combination of flexible, interrelated, and context-dependent constructions. We encourage researchers to reimagine appropriate benchmarks and analyses that acknowledge the rich, flexible generalizations that comprise natural languages.

pdf bib abs
You Shall Know a Construction by the Company it Keeps: Computational Construction Grammar with Embeddings
Lara Verheyen | Jonas Doumen | Paul Van Eecke | Katrien Beuls

Linguistic theories and models of natural language can be divided into two categories, depending on whether they represent and process linguistic information numerically or symbolically. Numerical representations, such as the embeddings that are at the core of today’s large language models, have the advantage of being learnable from textual data, and of being robust and highly scalable. Symbolic representations, like the ones that are commonly used to formalise construction grammar theories, have the advantage of being compositional and interpretable, and of supporting sound logic reasoning. While both approaches build on very different mathematical frameworks, there is no reason to believe that they are incompatible. In the present paper, we explore how numerical, in casu distributional, representations of linguistic forms, constructional slots and grammatical categories can be integrated in a computational construction grammar framework, with the goal of reaping the benefits of both symbolic and numerical methods.

pdf bib abs
Constructions All the Way Up: From Sensory Experiences to Construction Grammars
Jérôme Botoko Ekila | Lara Verheyen | Katrien Beuls | Paul Van Eecke

Constructionist approaches to language posit that all linguistic knowledge is captured in constructions. These constructions pair form and meaning at varying levels of abstraction, ranging from purely substantive to fully abstract and are all acquired through situated communicative interactions. In this paper we provide computational support for these foundational principles. We present a model that enables an artificial learner agent to acquire a construction grammar directly from its sensory experience. The grammar is built from the ground up, i.e. without a given lexicon, predefined categories or ontology and covers a range of constructions, spanning from purely substantive to partially schematic. Our approach integrates two previously separate but related experiments, allowing the learner to incrementally build a linguistic inventory that solves a question-answering task in a synthetic environment. These findings demonstrate that linguistic knowledge at different levels can be mechanistically acquired from experience.

pdf bib abs
Performance and competence intertwined: A computational model of the Null Subject stage in English-speaking children
Soumik Dey | William Sakas

The empirically established null subject (NS) stage, lasting until about 4 years of age, involves frequent omission of subjects by children. Orfitelli and Hyams (2012) observe that young English speakers often confuse imperative NS utterances with declarative ones due to performance influences, promoting a temporary null subject grammar. We propose a new computational parameter to measure this misinterpretation and incorporate it into a simulated model of obligatory subject grammar learning. Using a modified version of the Variational Learner (Yang, 2012) which works for superset-subset languages, our simulations support Orfitelli and Hyams’ hypothesis. More generally, this study outlines a framework for integrating computational models in the study of grammatical acquisition alongside other key developmental factors.

pdf bib abs
A is for a-generics: Predicate Collectivity in Generic Constructions
Carlotta Marianna Cascino

Generic statements like *A dog has four legs* are central to encode general knowledge. Yet their form–meaning mapping remains elusive. Some predicates sound natural with indefinite singulars (*a*-generics), while others require the definite article (*the*-generics) or the bare plural (bare-plural generics). For instance, why do we say *The computer revolutionized education* but not *A computer revolutionized education*? We propose a construction-based account explaining why not all generic statements are created equal. Prior accounts invoke semantic notions like kind-reference, stage-levelness, or accidental generalization, but offer no unified explanation. This paper introduces a new explanatory dimension: predicate collectivity level, i.e. whether the predicate applies to each member of a group or to the whole group as a unit (without necessarily applying to each of its members individually). Using two preregistered acceptability experiments we show that *a*-generics, unlike *the*-generics and bare-plural generics, are dispreferred with collective predicates. The findings offer a functionally motivated, empirically supported account of morphosyntactic variation in genericity, providing a new entry point for Construction Grammar.

pdf bib abs
Rethinking Linguistic Structures as Dynamic Tensegrities
Remi van Trijp

Constructional approaches to language have evolved from rigid tree-based representations to framing constructions as flexible, multidimensional pairings of form and function. However, it remains unclear how to formalize this conceptual shift in ways that are both computationally scalable and scientifically insightful. This paper proposes dynamic tensegrity – a term derived from “tensile integrity” – as a novel architecture metaphor for modelling linguistic form. It argues that linguistic structure emerges from dynamically evolving networks of constraint-based tensions rather than fixed hierarchies. The paper explores the theoretical consequences of this view, supplemented with a proof-of-concept implementation in Fluid Construction Grammar, demonstrating how a tensegrity-inspired approach can support robustness and adaptivity in language processing.

pdf bib abs
Psycholinguistically motivated Construction-based Tree Adjoining Grammar
Shingo Hattori | Laura Kallmeyer | Rainer Osswald

This paper proposes a formal framework based on Tree Adjoining Grammar (TAG) that aims to incorporate central tenets of Construction Grammar while integrating mechanisms from a psycholinguistically motivated variant of TAG. Central ideas are (i) to give TAG-inspired tree representation to various constructions including schematic constructions like argument structure constructions, (ii) to link schematic constructions that are extensions of each other within a network of constructions, (iii) to make the derivation proceed incrementally, (iv) to allow the prediction of upcoming constructions during derivation and (v) to introduce the incremental extension of schematic constructions to larger ones via extension trees in a usage-based manner. The final point is the major novel contribution, which can be conceptualized as the on-the-fly traversal of the inheritance links in the network of constructions. Moreover, we present first experiments towards a parser implementation. We report preliminary results of extracting constructions from the Penn Treebank and automatically identifying constructions to be added during incremental parsing, based on a generative language model (GPT-2).

pdf bib abs
Assessing Minimal Pairs of Chinese Verb-Resultative Complement Constructions: Insights from Language Models
Xinyao Huang | Yue Pan | Stefan Hartmann | Yang Yanning

Chinese verb-resultative complement construction (VRCC), constitute a distinctive syntactic-semantic pattern in Chinese that integrates agent-patient dynamics with real-world state changes; yet widely used benchmarks such as CLiMP and ZhoBLiMP provide few minimal-pair probes tailored to these constructions. We introduce ZhVrcMP, a 1,204 pair dataset spanning two paradigms: resultative complement presence versus absence, and verb–complement order. The examples are drawn from Modern Chinese and are annotated for linguistic validity. Using mean log probability scoring, we evaluate Zh-Pythia models (14M-1.4B) and Mistral-7B-Instruct-v0.3. Larger Zh-Pythia models perform strongly, especially on the order paradigm, reaching 89.87% accuracy. Mistral-7B-Instruct-v0.3 shows lower perplexity yet overall weaker accuracy, underscoring the remaining difficulty of modeling constructional semantics in Chinese.

pdf bib abs
Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs
Supantho Rakshit | Adele E. Goldberg

The usage-based constructionist (UCx) approach to language posits that language comprises a network of learned form-meaning pairings (constructions) whose use is largely determined by their meanings or functions, requiring them to be graded and probabilistic. This study investigates whether the internal representations in Large Language Models (LLMs) reflect the proposed function-infused gradience. We analyze representations of the English Double Object (DO) and Prepositional Object (PO) constructions in Pythia-1.4B, using a dataset of 5000 sentence pairs systematically varied by human-rated preference strength for DO or PO. Geometric analyses show that the separability between the two constructions’ representations, as measured by energy distance or Jensen-Shannon divergence, is systematically modulated by gradient preference strength, which depends on lexical and functional properties of sentences. That is, more prototypical exemplars of each construction occupy more distinct regions in activation space, compared to sentences that could have equally well have occured in either construction. These results provide evidence that LLMs learn rich, meaning-infused, graded representations of constructions and offer support for geometric measures for representations in LLMs.

pdf bib abs
Annotating English Verb-Argument Structure via Usage-Based Analogy
Allen Minchun Hsiao | Laura A. Michaelis

This paper introduces a usage-based framework that models argument structure annotation as nearest-neighbor classification over verb–argument structure (VAS) embeddings. Instead of parsing sentences separately, the model aligns new tokens with previously observed constructions in an embedding space derived from semi-automatic corpus annotations. Pilot studies show that cosine similarity captures both form and meaning, that nearest-neighbor classification generalizes to dative alternation verbs, and that accuracy in locative alternation depends on the corpus source of exemplars. These results suggest that analogical classification is shaped by both structural similarity and corpus alignment, highlighting key considerations for scalable, construction-based annotation of new sentence inputs.

pdf bib abs
Can Constructions “SCAN” Compositionality ?
Ganesh Katrapati | Manish Shrivastava

Sequence to Sequence models struggle at compositionality and systematic generalisation even while they excel at many other tasks.We attribute this limitation to their failure to internalise constructions—conventionalised form–meaning pairings that license productive recombination. Building on these insights, we introduce an unsupervised procedure for mining pseudo-constructions: variable-slot templates automatically extracted from training data. When applied to the SCAN dataset, ourmethod yields large gains out-of-distribution splits: accuracy rises to 47.8% on ADD JUMP and to 20.3% on AROUND RIGHT without any architectural changes or additional supervision. The model also attains competitive performance with ≤ 40% of the original training data, demonstrating strong data efficiency. Our findings highlight the promise of construction-aware preprocessing as an alternative to heavy architectural or training-regime interventions.

pdf bib abs
From Form to Function: A Constructional NLI Benchmark
Claire Bonial | Taylor Pellegrin | Melissa Torgbi | Harish Tayyar Madabushi

We present CoGS-NLI, a Natural Language Inference (NLI) evaluation benchmark testing understanding of English phrasal constructions drawn from the Construction Grammar Schematicity (CoGS) corpus. This dataset of 1,500 NLI triples facilitates assessment of constructional understanding in a downstream inference task. We present an evaluation benchmark based on the performance of two language models, where we vary the number and kinds of examples given in the prompt, with and without chain-of-thought prompting. The best-performing model and prompt combination achieves a strong overall accuracy of .94 when provided in-context learning examples with the target phrasal constructions, whereas providing additional general NLI examples hurts performance. This evidences the value of resources explicitly capturing the semantics of phrasal constructions, while our qualitative analysis suggests caveats in assuming this performance indicates a deep understanding of constructional semantics.

pdf bib abs
Evaluating CxG Generalisation in LLMs via Construction-Based NLI Fine Tuning
Tom Mackintosh | Harish Tayyar Madabushi | Claire Bonial

We probe large language models’ ability to learn deep form-meaning mappings as defined by construction grammars. We introduce the ConTest-NLI benchmark of 80k sentences covering eight English constructions from highly lexicalized to highly schematic. Our pipeline generates diverse synthetic NLI triples via templating and the application of a model-in-the loop filter. This provides aspects of human validation to ensure challenge and label reliability. Zero-shot tests on leading LLMs reveal a 24% drop in accuracy between naturalistic (88%) and adversarial data (64%), with schematic patterns proving hardest. Fine-tuning on a subset of ConTest-NLI yields up to 9% improvement, yet our results highlight persistent abstraction gaps in current LLMs and offer a scalable framework for evaluating construction informed learning.

pdf bib abs
Construction Grammar Evidence for How LLMs Use Context-Directed Extrapolation to Solve Tasks
Harish Tayyar Madabushi | Claire Bonial

In this paper, we apply the lens of Construction Grammar to provide linguistically-grounded evidence for the recently introduced view of LLMs that moves beyond the “stochastic parrot” and “emergent Artificial General Intelligence” extremes. We provide further evidence, this time rooted in linguistic theory, that the capabilities of LLMs are best explained by a process of context-directed extrapolation from their training priors. This mechanism, guided by in-context examples in base models or the prompt in instruction-tuned models, clarifies how LLM performance can exceed stochastic parroting without achieving the scalable, general-purpose reasoning seen in humans. Construction Grammar is uniquely suited to this investigation, as it provides a precise framework for testing the boundary between true generalization and sophisticated pattern-matching on novel linguistic tasks. The ramifications of this framework explaining LLM performance are three-fold: first, there is explanatory power providing insights into seemingly idiosyncratic LLM weaknesses and strengths; second, there are empowering methods for LLM users to improve performance of smaller models in post-training; third, there is a need to shift LLM evaluation paradigms so that LLMs are assessed relative to the prevalence of relevant priors in training data, and Construction Grammar provides a framework to create such evaluation data.

pdf bib abs
A Computational CxG Aided search for ‘come to’ constructions in a corpus of African American Novels from 1920 to 1930
Kamal Abou Mikhael

This paper presents a pilot study of metaphors of motion in African American literary language (AALL) in two sub-corpora of novels published in 1920-1925 and 1926-1930. It assesses the effectiveness of Dunn’s (2024) unsupervised learning approach to computational construction grammar (c2xg) as a basis for searching for constructional metaphors, a purpose beyond its original design as a grammar-learning tool. This method is chosen for its statistical orientation and employed without pre-trained models to minimize bias towards standard language; its output is also used to choose a target search term. Focusing on the verbal phrase ‘come to’, the study analyzes argument-structure constructions that instantiate conceptual metaphors, most prominently experiencer-as-theme (e.g., ‘he came to know’) and experiencer-as-goal (e.g., ‘thoughts came to her’). The evaluation compares c2xg coverage against a manually annotated set of metaphors and examines the uniformity of metaphor types extracted. Results show that c2xg captures 52% and 63% of metaphoric constructions in the two sub-corpora, with variation in coverage and uniformity depending on the ambiguity of the construct. The study underscores the value of combining computational and manual analysis to obtain outcomes that are both informative and ethically aware when studying marginalized varieties of English.

pdf (full)
bib (full) Proceedings of the 21st Joint ACL - ISO Workshop on Interoperable Semantic Annotation (ISA-21)

pdf bib
Proceedings of the 21st Joint ACL - ISO Workshop on Interoperable Semantic Annotation (ISA-21)
Bunt Harry

pdf bib abs
Engagement and Non-Engagement: Two Notions at the Core of an Annotation Schema of Enunciative Strategies
Cyril Bruneau | Delphine Battistelli

This study provides an annotation schema of a wide range of enunciative strategies underlying every enunciation process by which an enunciator actualizes a predicative content. We show that most of these enunciative strategies involve the enunciator in a relationship of Engagement (concerned with the notions of truth value and axiological/appreciative value) or Non-Engagement toward a stated predicative content. Our approach takes place in the French enunciative framework rooted in the work of Bally (1932). We explicitly compare our approach with that of Appraisal theory (Martin and White, 2003). We also illustrate the applications of our schema with a manual annotation experiment conducted on a corpus of French history textbooks. This experiment reveals interesting diachronic variations in the enunciator’s modes of Engagement and Non-Engagement.

This paper describes some of the ongoing work within the ISO preliminary work item PWI 254617-17, ‘Interlinking of annotations’. This PWI investigates the possibilities and problems of combining annotations made with different annotation schemes. using the ‘interlinking’ approach (Bunt, 2024) applied to different parts of the multi-part standard ISO 24617, ‘Semantic annotation framework’. This paper focuses on the combination of ISO-TimeML and QuantML at the level of abstract syntax. A new version is defined for the ISO-TimeML abstract syntax specification and how it relates to the concrete (XML-based) syntax as a basis for this combination. As a side-effect, some issues in the use of ISO-TimeML come to light that could be relevant for a possible future second edition of this standard.

pdf bib abs
The representation of QuantML annotations in UMR - an exploration
Harry Bunt | Kiyong Lee

This paper explores the possibilities and the problems in using Unified Meaning Representations (UMRs) for representing annotations of quantification phenomena, according to the ISO standard scheme QuantML (ISO 24617-12:2025). We show that the semantic information in QuantML annotations can we expressed in UMR, provided that some powerful semantic concepts are introduced and a slightly more general approach is adopted for the representation of multiple scope relations. Conversion functions are defined that transform the XML-based representations of QuantML into UMR structures and vice versa. The consequences are discussed that can be drawn from this regarding the possible role of UMR and the semantics of UMR representations of quantification.

pdf bib abs
Cococorpus: a corpus of copredication
Long Chen | Deniz Ekin Yavaş | Laura Kallmeyer | Rainer Osswald

While copredication has been widely investigated as a linguistic phenomenon, there is a notable lack of systematically annotated data to support empirical and quantitative research. This paper gives an overview of the ongoing construction of Cococorpus, a corpus of copredication, describes the annotation methodology and guidelines, and presents preliminary findings from the annotated data. Currently, the corpus contains 1500 gold-standard manual annotations including about 200 sentences with copredications. The annotated data not only supports the empirical validation for existing theories of copredication, but also reveals regularities that may inform theoretical development.

pdf bib abs
Can ISO 24617-1 go clinical? Extending a General-Domain Scheme to Medical Narratives
Ana Luísa Fernandes | Purificação Silvano | António Leal | Nuno Guimarães | Evelin Amorim

The definition of rigorous and well-structured annotation schemes is a key element in the advancement of Natural Language Processing (NLP). This paper aims to compare the performance of a general-purpose annotation scheme — Text2Story, based on the ISO 24617-1 standard — with that of a domain-specific scheme — i2b2 — in the context of clinical narrative annotation; and to assess the feasibility of harmonizing ISO 24617-1, originally designed for general-domain applications, with a specialized extension tailored to the medical domain. Based on the results of this comparative analysis, we present Med2Story, a medical-specific extension of ISO 24617-1 developed to address the particularities of clinical text annotation.

pdf bib abs
Enhancing ISO 24617-2: Formalizing Apology and Thanking Acts for Spoken Russian Dialogue Annotation
Ksenia Klokova | Anton Bankov | Nikolay Ignatiev

This paper refines ISO 24617-2’s Social Obligations Management dimension by formalizing apology and thanking acts for Russian dialogue annotation. Addressing gaps in formal definitions and limited response strategies, we propose culture-neutral semantic cores using Wierzbicka’s universal primes and update semantics. We introduce three response functions: address (minimal acknowledgment), downplay (mitigation), and decline (reinforcement). Validated through qualitative analysis, this framework captures empirical strategies—including non-response, formulaic minimization, and strategic obligation maintenance—unaddressed in the current standard. Our approach maintains ISO compatibility while eliminating unsubstantiated elements like obligatory response pressure, enhancing annotation accuracy for Russian dialogue.

pdf bib abs
An annotation scheme for financial news in Portuguese
António Leal | Purificação Silvano | Zuo Qinren | Evelin Amorim | Alípio Jorge

We present an annotation scheme designed to capture information related to the maintenance or change in the price of some goods (fuels, water, and vehicles) in news articles in Portuguese. The methodology we used involved adapting an existing annotation scheme, the Text2Story scheme (Silvano et al., 2021; Leal et al., 2022), which is based on different parts of ISO 24617 to capture the essential information for this project. Adaptations were needed to accommodate specific information, namely, information related to quantitative data and comparative relations that are abundant in this type of news. In this paper, we provide an overview of the annotation scheme, highlighting attributes and values of the entity and link structures specifically designed to capture financial information, as well as some problems we had to overcome in the process of building it and the rationale of some decisions behind its overall architecture.

As precursor work in preparation for an international standard ISO/PWI 24617-16 Language resource management – Semantic annotation – Part 16: Evaluative language, we aim to test and enhance the reliability of the annotation of subjective evaluation based on Appraisal Theory. We describe a comprehensive three-phase workflow tested on COVID-19 media reports to achieve reliable agreement through progressive training and quality control. Our methodology addresses some of the key challenges through the refinement of targeted guideline refinements and the development of interactive clarification tools, alongside a custom platform that enables the pre-classification of six evaluative categories, systematic annotation review, and organized documentation. We report empirical results that demonstrate substantial improvements from the initial moderate agreement to a strong final consensus. Our research offers both theoretical refinements addressing persistent classification challenges in evaluation and practical solutions for the implementation of the annotation workflow, proposing a replicable methodology for the achievement of reliable annotation consistency in the annotation of evaluative language.

This project note describes challenges and procedures undertaken in annotating an audiovisual dataset capturing a multimodal situated collaborative construction task. In the task, all participants begin with different partial information, and must collaborate using speech, gesture, and action to arrive a solution that satisfies all individual pieces of private information. This rich data poses a number of annotation challenges, from small objects in a close space, to the implicit and multimodal fashion in which participants express agreement, disagreement, and beliefs. We discuss the data collection procedure, annotation schemas and tools, and future use cases.