Jose Maria Alonso-Moral
Also published as: Jose M. Alonso, Jose Alonso, Jose M. Alonso-Moral, J.M. Alonso-Moral
2025
Enhancing Training Data Quality through Influence Scores for Generalizable Classification: A Case Study on Sexism Detection
Rabiraj Bandyopadhyay | Dennis Assenmacher | Jose M. Alonso-Moral | Claudia Wagner
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Rabiraj Bandyopadhyay | Dennis Assenmacher | Jose M. Alonso-Moral | Claudia Wagner
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
The quality of training data is crucial for the performance of supervised machine learning models. In particular, poor annotation quality and spurious correlations between labels and features in text dataset can significantly degrade model generalization. This problem is especially pronounced in harmful language detection, where prior studies have revealed major deficiencies in existing datasets. In this work, we design and test data selection methods based on learnability measures to improve dataset quality. Using a sexism dataset with counterfactuals designed to avoid spurious correlations, we show that pruning with EL2N and PVI scores can lead to significant performance increases and outperforms submodular and random selection. Our analysis reveals that in presence of label imbalance models rely on dataset shortcuts; especially easy-to-classify sexist instances and hard-to-classify non-sexist instances contain shortcuts. Pruning these instances leads to performances increases. Pruning hard-to-classify instances is in general a promising strategy as well when shortcuts are not present.
2024
ReproHum #0927-3: Reproducing The Human Evaluation Of The DExperts Controlled Text Generation Method
Javier González Corbelle | A. Vivel-Couso | J.M. Alonso-Moral | A. Bugarín-Diz
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Javier González Corbelle | A. Vivel-Couso | J.M. Alonso-Moral | A. Bugarín-Diz
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
This paper presents a reproduction study aimed at reproducing and validating a human NLP evaluation performed for the DExperts text generation method. The original study introduces DExperts, a controlled text generation method, evaluated using non-toxic prompts from the RealToxicityPrompts dataset. Our reproduction study aims to reproduce the human evaluation of the continuations generated by DExperts in comparison with four baseline methods, in terms of toxicity, topicality, and fluency. We first describe the agreed approach for reproduction within the ReproHum project and detail the configuration of the original evaluation, including necessary adaptations for reproduction. Then, we make a comparison of our reproduction results with those reported in the reproduced paper. Interestingly, we observe how the human evaluators in our experiment appreciate higher quality in the texts generated by DExperts in terms of less toxicity and better fluency. All in all, new scores are higher, also for the baseline methods. This study contributes to ongoing efforts in ensuring the reproducibility and reliability of findings in NLP evaluation and emphasizes the critical role of robust methodologies in advancing the field.
2023
Some lessons learned reproducing human evaluation of a data-to-text system
Javier González-Corbelle | Jose M. Alonso-Moral | A. Bugarín-Diz
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
Javier González-Corbelle | Jose M. Alonso-Moral | A. Bugarín-Diz
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
This paper presents a human evaluation reproduction study regarding the data-to-text generation task. The evaluation focuses in counting the supported and contradicting facts generated by a neural data-to-text model with a macro planning stage. The model is tested generating sport summaries for the ROTOWIRE dataset. We first describe the approach to reproduction that is agreed in the context of the ReproHum project. Then, we detail the entire configuration of the original human evaluation and the adaptations that had to be made to reproduce such an evaluation. Finally, we compare the reproduction results with those reported in the paper that was taken as reference.
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Gavin Abercrombie | Jose M. Alonso-Moral | Mohammad Arvan | Anouck Braggaar | Mark Cieliebak | Elizabeth Clark | Kees van Deemter | Tanvi Dinkar | Ondřej Dušek | Steffen Eger | Qixiang Fang | Mingqi Gao | Albert Gatt | Dimitra Gkatzia | Javier González-Corbelle | Dirk Hovy | Manuela Hürlimann | Takumi Ito | John D. Kelleher | Filip Klubicka | Emiel Krahmer | Huiyuan Lai | Chris van der Lee | Yiru Li | Saad Mahamood | Margot Mieskes | Emiel van Miltenburg | Pablo Mosteiro | Malvina Nissim | Natalie Parde | Ondřej Plátek | Verena Rieser | Jie Ruan | Joel Tetreault | Antonio Toral | Xiaojun Wan | Leo Wanner | Lewis Watson | Diyi Yang
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Gavin Abercrombie | Jose M. Alonso-Moral | Mohammad Arvan | Anouck Braggaar | Mark Cieliebak | Elizabeth Clark | Kees van Deemter | Tanvi Dinkar | Ondřej Dušek | Steffen Eger | Qixiang Fang | Mingqi Gao | Albert Gatt | Dimitra Gkatzia | Javier González-Corbelle | Dirk Hovy | Manuela Hürlimann | Takumi Ito | John D. Kelleher | Filip Klubicka | Emiel Krahmer | Huiyuan Lai | Chris van der Lee | Yiru Li | Saad Mahamood | Margot Mieskes | Emiel van Miltenburg | Pablo Mosteiro | Malvina Nissim | Natalie Parde | Ondřej Plátek | Verena Rieser | Jie Ruan | Joel Tetreault | Antonio Toral | Xiaojun Wan | Leo Wanner | Lewis Watson | Diyi Yang
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.
2022
Dealing with hallucination and omission in neural Natural Language Generation: A use case on meteorology.
Javier González-Corbelle | Jose M. Alonso-Moral | A. Bugarín-Diz | J. Taboada
Proceedings of the 15th International Conference on Natural Language Generation
Javier González-Corbelle | Jose M. Alonso-Moral | A. Bugarín-Diz | J. Taboada
Proceedings of the 15th International Conference on Natural Language Generation
2020
A proof of concept on triangular test evaluation for Natural Language Generation
Javier Gonzalez-Corbelle | Jose M. Alonso | A. Bugarín
Proceedings of the 1st Workshop on Evaluating NLG Evaluation
Javier Gonzalez-Corbelle | Jose M. Alonso | A. Bugarín
Proceedings of the 1st Workshop on Evaluating NLG Evaluation
The evaluation of Natural Language Generation (NLG) systems has recently aroused much interest in the research community, since it should address several challenging aspects, such as readability of the generated texts, adequacy to the user within a particular context and moment and linguistic quality-related issues (e.g., correctness, coherence, understandability), among others. In this paper, we propose a novel technique for evaluating NLG systems that is inspired on the triangular test used in the field of sensory analysis. This technique allows us to compare two texts generated by different subjects and to i) determine whether statistically significant differences are detected between them when evaluated by humans and ii) quantify to what extent the number of evaluators plays an important role in the sensitivity of the results. As a proof of concept, we apply this evaluation technique in a real use case in the field of meteorology, showing the advantages and disadvantages of our proposal.
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence
Jose M. Alonso | Alejandro Catala
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence
Jose M. Alonso | Alejandro Catala
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence
Towards Harnessing Natural Language Generation to Explain Black-box Models
Ettore Mariotti | Jose M. Alonso | Albert Gatt
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence
Ettore Mariotti | Jose M. Alonso | Albert Gatt
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence
The opaque nature of many machine learning techniques prevents the wide adoption of powerful information processing tools for high stakes scenarios. The emerging field eXplainable Artificial Intelligence (XAI) aims at providing justifications for automatic decision-making systems in order to ensure reliability and trustworthiness in the users. For achieving this vision, we emphasize the importance of a natural language textual modality as a key component for a future intelligent interactive agent. We outline the challenges of XAI and review a set of publications that work in this direction.
2019
Proceedings of the 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI 2019)
Jose M. Alonso | Alejandro Catala
Proceedings of the 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI 2019)
Jose M. Alonso | Alejandro Catala
Proceedings of the 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI 2019)
Paving the way towards counterfactual generation in argumentative conversational agents
Ilia Stepin | Alejandro Catala | Martin Pereira-Fariña | Jose M. Alonso
Proceedings of the 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI 2019)
Ilia Stepin | Alejandro Catala | Martin Pereira-Fariña | Jose M. Alonso
Proceedings of the 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI 2019)
2018
Supporting Content Design with an Eye Tracker: The Case of Weather-based Recommendations
Alejandro Catala | Jose M. Alonso | Alberto Bugarin
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)
Alejandro Catala | Jose M. Alonso | Alberto Bugarin
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)
Meteorologists and Students: A resource for language grounding of geographical descriptors
Alejandro Ramos-Soto | Ehud Reiter | Kees van Deemter | Jose M. Alonso | Albert Gatt
Proceedings of the 11th International Conference on Natural Language Generation
Alejandro Ramos-Soto | Ehud Reiter | Kees van Deemter | Jose M. Alonso | Albert Gatt
Proceedings of the 11th International Conference on Natural Language Generation
We present a data resource which can be useful for research purposes on language grounding tasks in the context of geographical referring expression generation. The resource is composed of two data sets that encompass 25 different geographical descriptors and a set of associated graphical representations, drawn as polygons on a map by two groups of human subjects: teenage students and expert meteorologists.
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)
Jose M. Alonso | Alejandro Catala | Mariët Theune
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)
Jose M. Alonso | Alejandro Catala | Mariët Theune
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)
2017
Linguistic Description of Complex Phenomena with the rLDCP R Package
Jose M. Alonso | Patricia Conde-Clemente | Gracian Trivino
Proceedings of the 10th International Conference on Natural Language Generation
Jose M. Alonso | Patricia Conde-Clemente | Gracian Trivino
Proceedings of the 10th International Conference on Natural Language Generation
Monitoring and analysis of complex phenomena attract the attention of both academy and industry. Dealing with data produced by complex phenomena requires the use of advance computational intelligence techniques. Namely, linguistic description of complex phenomena constitutes a mature research line. It is supported by the Computational Theory of Perceptions grounded on the Fuzzy Sets Theory. Its aim is the development of computational systems with the ability to generate vague descriptions of the world in a similar way how humans do. This is a human-centric and multi-disciplinary research work. Moreover, its success is a matter of careful design; thus, developers play a key role. The rLDCP R package was designed to facilitate the development of new applications. This demo introduces the use of rLDCP, for both beginners and advance developers, in practical use cases.
Search
Fix author
Co-authors
- Alberto Bugarín 6
- Alejandro Catala 5
- Javier González Corbelle 5
- Albert Gatt 3
- Ehud Reiter 3
- Gracian Trivino 2
- Kees van Deemter 2
- Gavin Abercrombie 1
- Mohammad Arvan 1
- Dennis Assenmacher 1
- Rabiraj Bandyopadhyay 1
- Anja Belz 1
- Anouck Braggaar 1
- Mark Cieliebak 1
- Elizabeth Clark 1
- Patricia Conde-Clemente 1
- Tanvi Dinkar 1
- Ondřej Dušek 1
- Steffen Eger 1
- Qixiang Fang 1
- Mingqi Gao 1
- Dimitra Gkatzia 1
- Dirk Hovy 1
- Manuela Huerlimann 1
- Takumi Ito 1
- John Kelleher 1
- Filip Klubicka 1
- Emiel Krahmer 1
- Huiyuan Lai 1
- Yiru Li 1
- Saad Mahamood 1
- Ettore Mariotti 1
- Margot Mieskes 1
- Pablo Mosteiro 1
- Malvina Nissim 1
- Natalie Parde 1
- Martín Pereira-Fariña 1
- Ondřej Plátek 1
- Alejandro Ramos-Soto 1
- Verena Rieser 1
- Jie Ruan 1
- Ilia Stepin 1
- J. Taboada 1
- Joel Tetreault 1
- Mariët Theune 1
- Craig Thomson 1
- Antonio Toral 1
- Emiel Van Miltenburg 1
- A. Vivel-Couso 1
- Claudia Wagner 1
- Xiaojun Wan 1
- Leo Wanner 1
- Lewis Watson 1
- Diyi Yang 1
- Chris van der Lee 1