Pedro Kroll

2026

Environmental, Social, and Governance (ESG) factors are becoming increasingly central to corporate accountability and sustainable development. However, benchmarks for evaluating large language models (LLMs) in this domain remain scarce. To alleviate this gap, we present ESG-QA, a dataset of 87,261 question–answer–context triplets spanning the three ESG pillars. ESG-QA was built using an LLM-based Question Answer (QA) generation pipeline, enhanced through rule-based and semantic filtering, and validated by human inspection, enabling both abstractive QA and retrieval-augmented setups. We benchmark three open-weight LLM families (Llama-3, Gemma-3, and Qwen-3) across multiple dimensions, including correctness, environmental impact, and readability. Results show that Qwen-3 with retrieval achieves the highest absolute QA performance, while Gemma-3 provides the strongest overall balance between correctness, efficiency, and clarity. By releasing ESG-QA and its generation framework, this work establishes a comprehensive benchmark for advancing ESG-oriented QA and promoting more transparent and responsible AI evaluation.

Co-authors

Darian Rabbani 1

Ayrton Surica 1

Venues

LREC1

Fix author