Julio Gonzalo
2025
Evaluating Sequence Labeling on the basis of Information Theory
Enrique Amigo | Elena Álvarez-Mellado | Julio Gonzalo | Jorge Carrillo-de-Albornoz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Enrique Amigo | Elena Álvarez-Mellado | Julio Gonzalo | Jorge Carrillo-de-Albornoz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Various metrics exist for evaluating sequence labeling problems (strict span matching, token oriented metrics, token concurrence in sequences, etc.), each of them focusing on certain aspects of the task. In this paper, we define a comprehensive set of formal properties that captures the strengths and weaknesses of the existing metric families and prove that none of them is able to satisfy all properties simultaneously. We argue that it is necessary to measure how much information (correct or noisy) each token in the sequence contributes depending on different aspects such as sequence length, number of tokens annotated by the system, token specificity, etc. On this basis, we introduce the Sequence Labelling Information Contrast Model (SL-ICM), a novel metric based on information theory for evaluating sequence labeling tasks. Our formal analysis and experimentation show that the proposed metric satisfies all properties simultaneously
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Eva Sánchez Salido | Roser Morante | Julio Gonzalo | Guillermo Marco | Jorge Carrillo-de-Albornoz | Laura Plaza | Enrique Amigo | Andrés Fernandez García | Alejandro Benito-Santos | Adrián Ghajari Espinosa | Victor Fresno
Proceedings of the 31st International Conference on Computational Linguistics
Eva Sánchez Salido | Roser Morante | Julio Gonzalo | Guillermo Marco | Jorge Carrillo-de-Albornoz | Laura Plaza | Enrique Amigo | Andrés Fernandez García | Alejandro Benito-Santos | Adrián Ghajari Espinosa | Victor Fresno
Proceedings of the 31st International Conference on Computational Linguistics
In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and manually translated into English, and have not ever been publicly released, ensuring minimal contamination when evaluating Large Language Models with this dataset. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) Smaller models not only perform worse than the largest models, but also degrade faster in Spanish than in English. The performance gap between both languages is negligible for the best models, but grows up to 37% for smaller models; (ii) Model ranking on UNED-ACCESS 2024 is almost identical (0.98 Pearson correlation) to the one obtained with MMLU (a similar, but publicly available benchmark), suggesting that contamination affects similarly to all models, and (iii) As in publicly available datasets, reasoning questions in UNED-ACCESS are more challenging for models of all sizes.
Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs
Guillermo Marco | Luz Rello | Julio Gonzalo
Proceedings of the 31st International Conference on Computational Linguistics
Guillermo Marco | Luz Rello | Julio Gonzalo
Proceedings of the 31st International Conference on Computational Linguistics
In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART-large, and compare its performance to human writers and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human study in which 68 participants rated short stories from humans and the SLM on grammaticality, relevance, creativity, and attractiveness, and (ii) a qualitative linguistic analysis examining the textual characteristics of stories produced by each model. In the first experiment, BART-large outscored average human writers overall (2.11 vs. 1.85), a 14% relative improvement, though the slight human advantage in creativity was not statistically significant. In the second experiment, qualitative analysis showed that while GPT-4o demonstrated near-perfect coherence and used less cliche phrases, it tended to produce more predictable language, with only 3% of its synopses featuring surprising associations (compared to 15% for BART). These findings highlight how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks, and demonstrate that smaller models can, in certain contexts, rival both humans and larger models.
The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing
Guillermo Marco | Julio Gonzalo | Víctor Fresno
Findings of the Association for Computational Linguistics: ACL 2025
Guillermo Marco | Julio Gonzalo | Víctor Fresno
Findings of the Association for Computational Linguistics: ACL 2025
Recent studies comparing AI-generated and human-authored literary texts have produced conflicting results: some suggest AI already surpasses human quality, while others argue it still falls short. We start from the hypothesis that such divergences can be largely explained by genuine differences in how readers interpret and value literature, rather than by an intrinsic quality of the texts evaluated. Using five public datasets (1,471 stories, 101 annotators including critics, students, and lay readers), we (i) extract 17 reference-less textual features (e.g., coherence, emotional variance, average sentence length...); (ii) model individual reader preferences, deriving feature importance vectors that reflect their textual priorities; and (iii) analyze these vectors in a shared “preference space”. Reader vectors cluster into two profiles: _surface-focused readers_ (mainly non-experts), who prioritize readability and textual richness; and _holistic readers_ (mainly experts), who value thematic development, rhetorical variety, and sentiment dynamics. Our results quantitatively explain how measurements of literary quality are a function of how text features align with each reader’s preferences. These findings advocate for reader-sensitive evaluation frameworks in the field of creative text generation.
2024
Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?
Guillermo Marco | Julio Gonzalo | M.Teresa Mateo-Girona | Ramón Del Castillo Santos
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Guillermo Marco | Julio Gonzalo | M.Teresa Mateo-Girona | Ramón Del Castillo Santos
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent’s. Then, we prepared an evaluation rubric inspired by Boden’s definition of creativity, and we collected several detailed expert assessments of the texts, provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer. We also observed that GPT-4 writes more creatively using Pron’s titles than its own titles (which is an indication of the potential for human-machine co-creation). Additionally, we found that GPT-4 has a more creative writing style in English than in Spanish.
A Web Portal about the State of the Art of NLP Tasks in Spanish
Enrique Amigó | Jorge Carrillo-de-Albornoz | Andrés Fernández | Julio Gonzalo | Guillermo Marco | Roser Morante | Laura Plaza | Jacobo Pedrosa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Enrique Amigó | Jorge Carrillo-de-Albornoz | Andrés Fernández | Julio Gonzalo | Guillermo Marco | Roser Morante | Laura Plaza | Jacobo Pedrosa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper presents a new web portal with information about the state of the art of natural language processing tasks in Spanish. It provides information about forums, competitions, tasks and datasets in Spanish, that would otherwise be spread in multiple articles and web sites. The portal consists of overview pages where information can be searched for and filtered by several criteria and individual pages with detailed information and hyperlinks to facilitate navigation. Information has been manually curated from publications that describe competitions and NLP tasks from 2013 until 2023 and will be updated as new tasks appear. A total of 185 tasks and 128 datasets from 94 competitions have been introduced.
2020
An Effectiveness Metric for Ordinal Classification: Formal Properties and Experimental Results
Enrique Amigo | Julio Gonzalo | Stefano Mizzaro | Jorge Carrillo-de-Albornoz
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Enrique Amigo | Julio Gonzalo | Stefano Mizzaro | Jorge Carrillo-de-Albornoz
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
In Ordinal Classification tasks, items have to be assigned to classes that have a relative ordering, such as “positive”, “neutral”, “negative” in sentiment analysis. Remarkably, the most popular evaluation metrics for ordinal classification tasks either ignore relevant information (for instance, precision/recall on each of the classes ignores their relative ordering) or assume additional information (for instance, Mean Average Error assumes absolute distances between classes). In this paper we propose a new metric for Ordinal Classification, Closeness Evaluation Measure, that is rooted on Measurement Theory and Information Theory. Our theoretical analysis and experimental results over both synthetic data and data from NLP shared tasks indicate that the proposed metric captures quality aspects from different traditional tasks simultaneously. In addition, it generalizes some popular classification (nominal scale) and error minimization (interval scale) metrics, depending on the measurement scale in which it is instantiated.
2012
Automatic Extraction of Polar Adjectives for the Creation of Polarity Lexicons
Silvia Vázquez | Muntsa Padró | Núria Bel | Julio Gonzalo
Proceedings of COLING 2012: Posters
Silvia Vázquez | Muntsa Padró | Núria Bel | Julio Gonzalo
Proceedings of COLING 2012: Posters
UNED: Improving Text Similarity Measures without Human Assessments
Enrique Amigó | Jesús Giménez | Julio Gonzalo | Felisa Verdejo
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
Enrique Amigó | Jesús Giménez | Julio Gonzalo | Felisa Verdejo
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
The Heterogeneity Principle in Evaluation Measures for Automatic Summarization
Enrique Amigó | Julio Gonzalo | Felisa Verdejo
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
Enrique Amigó | Julio Gonzalo | Felisa Verdejo
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
2011
Corroborating Text Evaluation Results with Heterogeneous Measures
Enrique Amigó | Julio Gonzalo | Jesús Giménez | Felisa Verdejo
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
Enrique Amigó | Julio Gonzalo | Jesús Giménez | Felisa Verdejo
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
2010
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results
Celina Santamaría | Julio Gonzalo | Javier Artiles
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Celina Santamaría | Julio Gonzalo | Javier Artiles
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
2009
The role of named entities in Web People Search
Javier Artiles | Enrique Amigó | Julio Gonzalo
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
Javier Artiles | Enrique Amigó | Julio Gonzalo
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
Enrique Amigó | Jesús Giménez | Julio Gonzalo | Felisa Verdejo
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
Enrique Amigó | Jesús Giménez | Julio Gonzalo | Felisa Verdejo
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
The Impact of Query Refinement in the Web People Search Task
Javier Artiles | Julio Gonzalo | Enrique Amigó
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Javier Artiles | Julio Gonzalo | Enrique Amigó
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
2008
From Research to Application in Multilingual Information Access: the Contribution of Evaluation
Carol Peters | Martin Braschler | Giorgio Di Nunzio | Nicola Ferro | Julio Gonzalo | Mark Sanderson
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Carol Peters | Martin Braschler | Giorgio Di Nunzio | Nicola Ferro | Julio Gonzalo | Mark Sanderson
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The importance of evaluation in promoting research and development in the information retrieval and natural language processing domains has long been recognised but is this sufficient? In many areas there is still a considerable gap between the results achieved by the research community and their implementation in commercial applications. This is particularly true for the cross-language or multilingual retrieval areas. Despite the strong demand for and interest in multilingual IR functionality, there are still very few operational systems on offer. The Cross Language Evaluation Forum (CLEF) is now taking steps aimed at changing this situation. The paper provides a critical assessment of the main results achieved by CLEF so far and discusses plans now underway to extend its activities in order to have a more direct impact on the application sector.
2007
The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task
Javier Artiles | Julio Gonzalo | Satoshi Sekine
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)
Javier Artiles | Julio Gonzalo | Satoshi Sekine
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)
2006
MT Evaluation: Human-Like vs. Human Acceptable
Enrique Amigó | Jesús Giménez | Julio Gonzalo | Lluís Màrquez
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions
Enrique Amigó | Jesús Giménez | Julio Gonzalo | Lluís Màrquez
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions
2005
QARLA: A Framework for the Evaluation of Text Summarization Systems
Enrique Amigó | Julio Gonzalo | Anselmo Peñas | Felisa Verdejo
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)
Enrique Amigó | Julio Gonzalo | Anselmo Peñas | Felisa Verdejo
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)
Evaluating DUC 2004 Tasks with the QARLA Framework
Enrique Amigó | Julio Gonzalo | Anselmo Peñas | Felisa Verdejo
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization
Enrique Amigó | Julio Gonzalo | Anselmo Peñas | Felisa Verdejo
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization
2004
Using syntactic information to extract relevant terms for multi-document summarization
Enrique Amigó | Julio Gonzalo | Víctor Peinado | Anselmo Peñas | Felisa Verdejo
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics
Enrique Amigó | Julio Gonzalo | Víctor Peinado | Anselmo Peñas | Felisa Verdejo
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics
The Future of Evaluation for Cross-Language Information Retrieval Systems
Carol Peters | Martin Braschler | Khalid Choukri | Julio Gonzalo | Michael Kluck
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Carol Peters | Martin Braschler | Khalid Choukri | Julio Gonzalo | Michael Kluck
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
The objective of the Cross-Language Evaluation Forum (CLEF) is to promote research in the multilingual information access domain. In this short paper, we list the achievements of CLEF during its first four years of activity and describe how the range of tasks has been considerably expanded during this period. The aim of the paper is to demonstrate the importance of evaluation initiatives with respect to system research and development and to show how essential it is for such initiatives to keep abreast of and even anticipate the emerging needs of both system developers and application communities if they are to have a future.
An Empirical Study of Information Synthesis Task
Enrique Amigo | Julio Gonzalo | Victor Peinado | Anselmo Peñas | Felisa Verdejo
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)
Enrique Amigo | Julio Gonzalo | Victor Peinado | Anselmo Peñas | Felisa Verdejo
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)
2003
Automatic Association of Web Directories with Word Senses
Celina Santamaría | Julio Gonzalo | Felisa Verdejo
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus
Celina Santamaría | Julio Gonzalo | Felisa Verdejo
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus
2002
A Study of Polysemy and Sense Proximity in the Senseval-2 Test Suite
Irina Chugur | Julio Gonzalo | Felisa Verdejo
Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions
Irina Chugur | Julio Gonzalo | Felisa Verdejo
Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions
2001
Framework and Results for the Spanish SENSEVAL
German Rigau | Mariona Taulé | Ana Fernandez | Julio Gonzalo
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems
German Rigau | Mariona Taulé | Ana Fernandez | Julio Gonzalo
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems
The UNED Systems at SENSEVAL-2
David Fernández-Amorós | Julio Gonzalo | Felisa Verdejo
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems
David Fernández-Amorós | Julio Gonzalo | Felisa Verdejo
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems
2000
Evaluating Wordnets in Cross-language Information Retrieval: the ITEM Search Engine
Felisa Verdejo | Julio Gonzalo | Anselmo Peñas | Fernando López | David Fernández
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
Felisa Verdejo | Julio Gonzalo | Anselmo Peñas | Fernando López | David Fernández
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
Sense clusters for Information Retrieval: Evidence from Semcor and the EuroWordNet InterLingual Index
Julio Gonzalo | Irina Chugur | Felisa Verdejo
ACL-2000 Workshop on Word Senses and Multi-linguality
Julio Gonzalo | Irina Chugur | Felisa Verdejo
ACL-2000 Workshop on Word Senses and Multi-linguality
1999
Towards a Universal Index of Meaning
Piek Vossen | Wim Peters | Julio Gonzalo
SIGLEX99: Standardizing Lexical Resources
Piek Vossen | Wim Peters | Julio Gonzalo
SIGLEX99: Standardizing Lexical Resources
Lexical ambiguity and Information Retrieval revisited
Julio Gonzalo | Anselmo Penas | Felisa Verdejo
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora
Julio Gonzalo | Anselmo Penas | Felisa Verdejo
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora
An Open Distance Learning Web-Course for NLP in IR
Felisa Verdejo | Julio Gonzalo | Anselmo Penas
EACL 1999: Computer and Internet Supported Education in Language and Speech Technology
Felisa Verdejo | Julio Gonzalo | Anselmo Penas
EACL 1999: Computer and Internet Supported Education in Language and Speech Technology
1998
Indexing with WordNet synsets can improve text retrieval
Julio Gonzalo | Felisa Verdejo | Irina Chugur | Juan Cigarran
Usage of WordNet in Natural Language Processing Systems
Julio Gonzalo | Felisa Verdejo | Irina Chugur | Juan Cigarran
Usage of WordNet in Natural Language Processing Systems
1995
Generic Rules and Non-Constituent Coordination
Julio Gonzalo | Teresa Solías
Proceedings of the Fourth International Workshop on Parsing Technologies
Julio Gonzalo | Teresa Solías
Proceedings of the Fourth International Workshop on Parsing Technologies
We present a metagrammatical formalism, generic rules, to give a default interpretation to grammar rules. Our formalism introduces a process of dynamic binding interfacing the level of pure grammatical knowledge representation and the parsing level. We present an approach to non-constituent coordination within categorial grammars, and reformulate it as a generic rule. This reformulation is context-free parsable and reduces drastically the search space associated to the parsing task for such phenomena.
Search
Fix author
Co-authors
- Felisa Verdejo 16
- Enrique Amigó 15
- Anselmo Peñas 7
- Guillermo Marco 5
- Javier Artiles 4
- Jesús Giménez 4
- Jorge Carrillo-de-Albornoz 3
- Irina Chugur 3
- Martin Braschler 2
- David Fernández-Amorós 2
- Víctor Fresno 2
- Roser Morante 2
- Víctor Peinado 2
- Carol Peters 2
- Laura Plaza 2
- Celina Santamaría 2
- Núria Bel 1
- Alejandro Benito-Santos 1
- Jorge Carrillo-de-Albornoz 1
- Khalid Choukri 1
- Juan Cigarran 1
- Giorgio Maria Di Nunzio 1
- Ana Fernandez 1
- Andrés Fernández 1
- Nicola Ferro 1
- Andrés Fernandez García 1
- Adrián Ghajari Espinosa 1
- Michael Kluck 1
- Fernando López 1
- M.Teresa Mateo-Girona 1
- Stefano Mizzaro 1
- Lluís Màrquez 1
- Muntsa Padró 1
- Jacobo Pedrosa 1
- Wim Peters 1
- Luz Rello 1
- German Rigau 1
- Mark Sanderson 1
- Ramón Del Castillo Santos 1
- Satoshi Sekine 1
- Teresa Solías 1
- Eva Sánchez Salido 1
- Mariona Taulé 1
- Piek Vossen 1
- Silvia Vázquez 1
- Elena Álvarez-Mellado 1