2025
pdf
bib
abs
NGQA: A Nutritional Graph Question Answering Benchmark for Personalized Health-aware Nutritional Reasoning
Zheyuan Zhang
|
Yiyang Li
|
Nhi Ha Lan Le
|
Zehong Wang
|
Tianyi Ma
|
Vincent Galassi
|
Keerthiram Murugesan
|
Nuno Moniz
|
Werner Geyer
|
Nitesh V Chawla
|
Chuxu Zhang
|
Yanfang Ye
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Diet plays a critical role in human health, yet tailoring dietary reasoning to individual health conditions remains a major challenge. Nutrition Question Answering (QA) has emerged as a popular method for addressing this problem. However, current research faces two critical limitations. On one hand, the absence of datasets involving user-specific medical information severely limits personalization. This challenge is further compounded by the wide variability in individual health needs. On the other hand, while large language models (LLMs), a popular solution for this task, demonstrate strong reasoning abilities, they struggle with the domain-specific complexities of personalized healthy dietary reasoning, and existing benchmarks fail to capture these challenges. To address these gaps, we introduce the Nutritional Graph Question Answering (NGQA) benchmark, the first graph question answering dataset designed for personalized nutritional health reasoning. NGQA leverages data from the National Health and Nutrition Examination Survey (NHANES) and the Food and Nutrient Database for Dietary Studies (FNDDS) to evaluate whether a food is healthy for a specific user, supported by explanations of the key contributing nutrients. The benchmark incorporates three question complexity settings and evaluates reasoning across three downstream tasks. Extensive experiments with LLM backbones and baseline models demonstrate that the NGQA benchmark effectively challenges existing models. In sum, NGQA addresses a critical real-world problem while advancing GraphQA research with a novel domain-specific benchmark. Our codebase and dataset are available here.
pdf
bib
abs
Multi-Level Explanations for Generative Language Models
Lucas Monteiro Paes
|
Dennis Wei
|
Hyo Jin Do
|
Hendrik Strobelt
|
Ronny Luss
|
Amit Dhurandhar
|
Manish Nagireddy
|
Karthikeyan Natesan Ramamurthy
|
Prasanna Sattigeri
|
Werner Geyer
|
Soumya Ghosh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the increasing use of large language models (LLMs) for context-grounded tasks like summarization and question-answering, understanding what makes an LLM produce a certain response is challenging. We propose Multi-Level Explanations for Generative Language Models (MExGen), a technique to provide explanations for context-grounded text generation. MExGen assigns scores to parts of the context to quantify their influence on the model’s output. It extends attribution methods like LIME and SHAP to LLMs used in context-grounded tasks where (1) inference cost is high, (2) input text is long, and (3) the output is text. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and question answering. The results show that our framework can provide more faithful explanations of generated output than available alternatives, including LLM self-explanations. We open-source code for MExGen as part of the ICX360 toolkit: https://github.com/IBM/ICX360.
pdf
bib
abs
Granite Guardian: Comprehensive LLM Safeguarding
Inkit Padhi
|
Manish Nagireddy
|
Giandomenico Cornacchia
|
Subhajit Chaudhury
|
Tejaswini Pedapati
|
Pierre Dognin
|
Keerthiram Murugesan
|
Erik Miehling
|
Martín Santillán Cooper
|
Kieran Fraser
|
Giulio Zizzo
|
Muhammad Zaid Hameed
|
Mark Purcell
|
Michael Desmond
|
Qian Pan
|
Inge Vejsbjerg
|
Elizabeth M. Daly
|
Michael Hind
|
Werner Geyer
|
Ambrish Rawat
|
Kush R. Varshney
|
Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
The deployment of language models in real-world applications exposes users to various risks, including hallucinations and harmful or unethical content. These challenges highlight the urgent need for robust safeguards to ensure safe and responsible AI. To address this, we introduce Granite Guardian, a suite of advanced models designed to detect and mitigate risks associated with prompts and responses, enabling seamless integration with any large language model (LLM). Unlike existing open-source solutions, our Granite Guardian models provide comprehensive coverage across a wide range of risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related issues such as context relevance, groundedness, and answer accuracy in retrieval-augmented generation (RAG) scenarios. Trained on a unique dataset combining diverse human annotations and synthetic data, Granite Guardian excels in identifying risks often overlooked by traditional detection systems, particularly jailbreak attempts and RAG-specific challenges. https://github.com/ibm-granite/granite-guardian
2024
pdf
bib
abs
Human-Centered Design Recommendations for LLM-as-a-judge
Qian Pan
|
Zahra Ashktorab
|
Michael Desmond
|
Martín Santillán Cooper
|
James Johnson
|
Rahul Nair
|
Elizabeth Daly
|
Werner Geyer
Proceedings of the 1st Human-Centered Large Language Modeling Workshop
Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human’s intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners’ preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.