Fernando Diaz


2025

pdf bib
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
Athiya Deviyani | Fernando Diaz
Findings of the Association for Computational Linguistics: NAACL 2025

Meta-evaluation of automatic evaluation metrics—assessing evaluation metrics themselves—is crucial for accurately benchmarking natural language processing systems and has implications for scientific inquiry, production model development, and policy enforcement. While existing approaches to metric meta-evaluation focus on general statements about the absolute and relative quality of metrics across arbitrary system outputs, in practice, metrics are applied in highly contextual settings, often measuring the performance for a highly constrained set of system outputs. For example, we may only be interested in evaluating a specific model or class of models. We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics. Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts. This observed variation highlights the importance of adopting context-specific metric evaluations over global ones.

2024

pdf bib
Extrinsic Evaluation of Cultural Competence in Large Language Models
Shaily Bhatt | Fernando Diaz
Findings of the Association for Computational Linguistics: EMNLP 2024

Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models’ knowledge of cultural norms, values, and artefacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.

2016

pdf bib
Query Expansion with Locally-Trained Word Embeddings
Fernando Diaz | Bhaskar Mitra | Nick Craswell
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
Predicting Salient Updates for Disaster Summarization
Chris Kedzie | Kathleen McKeown | Fernando Diaz
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2010

pdf bib
Cross-Market Model Adaptation with Pairwise Preference Data for Web Search Ranking
Jing Bai | Fernando Diaz | Yi Chang | Zhaohui Zheng | Keke Chen
Coling 2010: Posters