Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2)

Timothée Bernard, Timothee Mickus (Editors)

Anthology ID:: 2025.brigap-1
Month:: September
Year:: 2025
Address:: Düsseldorf, Germany
Venues:: BriGap | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/iwcs-25-ingestion/2025.brigap-1/
DOI:
ISBN:: 979-8-89176-317-3
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/iwcs-25-ingestion/2025.brigap-1.pdf

pdf bib
Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2)
Timothée Bernard | Timothee Mickus

pdf bib abs
Natural Language Inference with CCG Parser and Automated Theorem Prover for DTS
Asa Tomita | Mai Matsubara | Hinari Daido | Daisuke Bekki

We propose a Natural Language Inference (NLI) system based on compositional semantics. The system combines lightblue, a syntactic and semantic parser grounded in Combinatory Categorial Grammar (CCG) and Dependent Type Semantics (DTS), with wani, an automated theorem prover for Dependent Type Theory (DTT). Because each computational step reflects a theoretical assumption, system evaluation serves as a form of hypothesis verification. We evaluate the inference system using the Japanese Semantic Test Suite JSeM, and demonstrate how error analysis provides feedback to improve both the system and the underlying linguistic theory.

pdf bib abs
Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance
Timothy Pistotti | Jason Brown | Michael J. Witbrock

Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.

pdf bib abs
Modal Subordination in Dependent Type Semantics
Aoi Iimura | Teruyuki Mizuno | Daisuke Bekki

In the field of natural language processing, the construction of “linguistic pipelines”, which draw on insights from theoretical linguistics, stands in a complementary relationship to the prevailing paradigm of large language models. The rapid development of these pipelines has been fueled by recent advancements, including the emergence of Dependent Type Semantics (DTS) — a type-theoretic framework for natural language semantics. While DTS has been successfully applied to analyze complex linguistic phenomena such as anaphora and presupposition, its capability to account for modal expressions remains an underexplored area. This study aims to address this gap by proposing a framework that extends DTS with modal types. This extension broadens the scope of linguistic phenomena that DTS can account for, including an analysis of modal subordination, where anaphora interacts with modal expressions.

pdf bib abs
Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments
Timothy Pistotti | Jason Brown | Michael J. Witbrock

Recent studies probing the Argument from the Poverty of the Stimulus (APS) have applied Large Language Models (LLMs) to test the learnability of complex syntax through surprisal-based metrics. However, divergent conclusions raise questions concerning the insights these metrics offer. While Wilcox et al. (2024) used direct minimal pair comparisons (the “wh-effect”) to demonstrate that models successfully generalise knowledge of filler-gap dependencies, Lan et al. (2024) used a Difference-in-Differences (DiD) metric and found that models largely fail on parasitic gaps (PGs). This paper argues that the direct minimal pair approach offers greater diagnostic transparency. We demonstrate this by generating a full 8-permutation paradigm of refined PG stimuli and evaluating the GPT-2 model used in previous studies with a systematic Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles even in complex PG environments. This finding, which contrasts with the more ambiguous results from DiD-style metrics, suggests that the choice of evaluation metric is critical for assessing an LLM’s syntactic competence.

pdf bib abs
Coordination of Theoretical and Computational Linguistics
Adam Przepiórkowski | Agnieszka Patejuk

The aim of this paper is to present a case study of a fruitful and, hopefully, inspiring interaction between formal and computational linguistics. A variety of NLP tools and resources have been used in linguistic investigations of the symmetry of coordination, leading to novel theoretical arguments. The converse impact of theoretical results on NLP work has been successful only in some cases.

pdf bib abs
An instructive implementation of semantic parsing and reasoning using Lexical Functional Grammar
Mark-Matthias Zymla | Kascha Kruschwitz | Paul Zodl

This paper presents a computational resource for exploring semantic parsing and reasoning through a strictly formal lense. Inspired by the framework of Lexical Functional Grammar, our system allows for modular exploration of different aspects of semantic parsing. It consists of a hand-coded formal grammar combining syntactic and semantic annotations, producing basic semantic representations. The system provides the option to extend these basic semantics via rewrite rules in a principled fashion to explore more complex reasoning. The result is a layered system enabling an incremental approach to semantic parsing. We illustrate this approach with examples from the Fracas testsuite demonstrating its overall functionality and viability.

pdf bib abs
Modelling Expectation-based and Memory-based Predictors of Human Reading Times with Syntax-guided Attention
Lukas Mielczarek | Timothée Bernard | Laura Kallmeyer | Katharina Spalek | Benoit Crabbé

The correlation between reading times and surprisal is well known in psycholinguistics and is easy to observe. There is also a correlation between reading times and structural integration, which is, however, harder to detect (Gibson, 2000). This correlation has been studied using parsing models whose outputs are linked to reading times. In this paper, we study the relevance of memory-based effects in reading times and how to predict them using neural language models. We find that integration costs significantly improve surprisal-based reading time prediction. Inspired by Timkey and Linzen (2023), we design a small-scale autoregressive transformer language model in which attention heads are supervised by dependency relations. We compare this model to a standard variant by checking how well each model’s outputs correlate with human reading times and find that predicted attention scores can be effectively used as proxies for syntactic integration costs to predict self-paced reading times.

pdf bib abs
On the relative impact of categorical and semantic information on the induction of self-embedding structures
Antoine Venant | Yutaka Suzuki

We investigate the impact of center embedding and selectional restrictions on neural latent tree models’ tendency to induce self-embedding structures. To this aim we compare their behavior in different controlled artificial environments involving noun phrases modified by relative clauses, with different quantity of available training data. Our results provide evidence that the existence of multiple center self-embedding is a stronger incentive than selectional restrictions alone, but that the combination of both is the best incentive overall. We also show that different architectures benefit very differently from these incentives.

pdf bib abs
Plural Interpretive Biases: A Comparison Between Human Language Processing and Language Models
Jia Ren

Human communication routinely relies on plural predication, and plural sentences are often ambiguous (see, e.g., Scha, 1984; Dalrymple et al., 1998a, to name a few). Building on extensive theoretical and experimental work in linguistics and philosophy, we ask whether large language models (LLMs) exhibit the same interpretive biases that humans show when resolving plural ambiguity. We focus on two lexical factors: (i) the collective bias of certain predicates (e.g., size/shape adjectives) and (ii) the symmetry bias of predicates. To probe these tendencies, we apply two complementary methods to premise–hypothesis pairs: an embedding-based heuristic using OpenAI’s text-embedding-3-large/small (OpenAI, 2024, 2025) with cosine similarity, and supervised NLI models (bart-large-mnli, roberta-large-mnli) (Lewis et al., 2020; Liu et al., 2019; Williams et al., 2018a; Facebook AI, 2024b,a) that yield asymmetric, calibrated entailment probabilities. Results show partial sensitivity to predicate-level distinctions, but neither method reproduces the robust human pattern, where neutral predicates favor entailment and strongly non-symmetric predicates disfavor it. These findings highlight both the potential and the limits of current LLMs: as cognitive models, they fall short of capturing human-like interpretive biases; as engineering systems, their representations of plural semantics remain unstable for tasks requiring precise entailment.