Víctor Fresno

Also published as: Victor Fresno

2022

pdf abs
Is anisotropy really the cause of BERT embeddings not being semantic?
Alejandro Fuster Baggetto | Victor Fresno
Findings of the Association for Computational Linguistics: EMNLP 2022

In this paper we conduct a set of experiments aimed to improve our understanding of the lack of semantic isometry in BERT, i.e. the lack of correspondence between the embedding and meaning spaces of its contextualized word representations. Our empirical results show that, contrary to popular belief, the anisotropy is not the root cause of the poor performance of these contextual models’ embeddings in semantic tasks. What does affect both the anisotropy and semantic isometry is a set of known biases: frequency, subword, punctuation, and case. For each one of them, we measure its magnitude and the effect of its removal, showing that these biases contribute but do not completely explain the phenomenon of anisotropy and lack of semantic isometry of these contextual language models.

pdf abs
Information Theory–based Compositional Distributional Semantics
Enrique Amigó | Alejandro Ariza-Casabona | Victor Fresno | M. Antònia Martí
Computational Linguistics, Volume 48, Issue 4 - December 2022

In the context of text representation, Compositional Distributional Semantics models aim to fuse the Distributional Hypothesis and the Principle of Compositionality. Text embedding is based on co-ocurrence distributions and the representations are in turn combined by compositional functions taking into account the text structure. However, the theoretical basis of compositional functions is still an open issue. In this article we define and study the notion of Information Theory–based Compositional Distributional Semantics (ICDS): (i) We first establish formal properties for embedding, composition, and similarity functions based on Shannon’s Information Theory; (ii) we analyze the existing approaches under this prism, checking whether or not they comply with the established desirable properties; (iii) we propose two parameterizable composition and similarity functions that generalize traditional approaches while fulfilling the formal properties; and finally (iv) we perform an empirical study on several textual similarity datasets that include sentences with a high and low lexical overlap, and on the similarity between words and their description. Our theoretical analysis and empirical results show that fulfilling formal properties affects positively the accuracy of text representation models in terms of correspondence (isometry) between the embedding and meaning spaces.

2014

In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.

pdf
A Data Driven Approach for Person Name Disambiguation in Web Search Results
Agustín D. Delgado | Raquel Martínez | Víctor Fresno | Soto Montalvo
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Víctor Fresno

2022

2014

2009

2006

Co-authors

Venues