Iñigo Lopez-Gazpio

Also published as: Inigo Lopez-Gazpio


2026

Natural Language Inference (NLI) is a long-standing probe of models’ reasoning capabilities, yet it remains unclear how state-of-the-art systems represent and combine logical clauses in a way that supports robust generalization. We study directional effects in deductive NLI and introduce causal coherence, an evaluation paradigm that tests whether predictions remain consistent when the directionality of inference is reversed. Using fine-grained minimal-pair phrase data from PhrasIS, we evaluate encoder, decoder, and encoder–decoder transformers and analyze their behavior under both standard and manipulated settings. Our results show that models frequently fail to maintain logical stability when directionality varies, indicating shallow pattern matching rather than genuine clause composition. We formalize soft and hard causal coherence to disentangle directional consistency from correctness, and we provide an error analysis that highlights systematic failures involving semantic relations. Our findings suggest that deductive causal reasoning and coherence remain missing components in current transformer architectures, and that addressing them is necessary for reliable NLI.

2018

Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. In this paper, we show that each embedding model captures more information than directly apparent. A linear transformation that adjusts the similarity order of the model without any external resource can tailor it to achieve better results in those aspects, providing a new perspective on how embeddings encode divergent linguistic information. In addition, we explore the relation between intrinsic and extrinsic evaluation, as the effect of our transformations in downstream tasks is higher for unsupervised systems than for supervised ones.

2017

Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).

2016

2015