Adina Yakefu
2026
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
Haoyu Dong | Pengkun Zhang | Yan Gao | Xuanyu Dong | Yilin Cheng | Mingzhe Lu | Adina Yakefu | Shuxin Zheng
Findings of the Association for Computational Linguistics: ACL 2026
Haoyu Dong | Pengkun Zhang | Yan Gao | Xuanyu Dong | Yilin Cheng | Mingzhe Lu | Adina Yakefu | Shuxin Zheng
Findings of the Association for Computational Linguistics: ACL 2026
We introduce FinWorkBench (a.k.a. Finch) for evaluating AI agents on real-world, enterprise-grade finance and accounting workflows that interleave data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces from Enron (15,000 files and 500,000 emails) and other financial institutions, covering the period 2000–2025 and preserving the in-the-wild messiness of multimodal artifacts such as tables and charts across diverse domains including budgeting, trading, asset management, and operational management.We propose a workflow construction process that combines LLM-assisted mining of workflows from authentic enterprise environments with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and spreadsheet version histories, and (2) meticulous annotation requiring over 700 hours of expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work.We conduct both human and automated evaluations of frontier AI systems, including GPT 5.1, Claude Sonnet/Opus 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max. Under human evaluation, GPT 5.1 Pro spends an average of 16.8 minutes per workflow yet passes only 38.4% of workflows. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.
2025
SHADES: Towards a Multilingual Assessment of Stereotypes in Large Language Models
Margaret Mitchell | Giuseppe Attanasio | Ioana Baldini | Miruna Clinciu | Jordan Clive | Pieter Delobelle | Manan Dey | Sil Hamilton | Timm Dill | Jad Doughman | Ritam Dutt | Avijit Ghosh | Jessica Zosa Forde | Carolin Holtermann | Lucie-Aimée Kaffee | Tanmay Laud | Anne Lauscher | Roberto L Lopez-Davila | Maraim Masoud | Nikita Nangia | Anaelia Ovalle | Giada Pistilli | Dragomir Radev | Beatrice Savoldi | Vipul Raheja | Jeremy Qin | Esther Ploeger | Arjun Subramonian | Kaustubh Dhole | Kaiser Sun | Amirbek Djanibekov | Jonibek Mansurov | Kayo Yin | Emilio Villa Cueva | Sagnik Mukherjee | Jerry Huang | Xudong Shen | Jay Gala | Hamdan Al-Ali | Tair Djanibekov | Nurdaulet Mukhituly | Shangrui Nie | Shanya Sharma | Karolina Stanczak | Eliza Szczechla | Tiago Timponi Torrent | Deepak Tunuguntla | Marcelo Viridiano | Oskar Van Der Wal | Adina Yakefu | Aurélie Névéol | Mike Zhang | Sydney Zink | Zeerak Talat
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Margaret Mitchell | Giuseppe Attanasio | Ioana Baldini | Miruna Clinciu | Jordan Clive | Pieter Delobelle | Manan Dey | Sil Hamilton | Timm Dill | Jad Doughman | Ritam Dutt | Avijit Ghosh | Jessica Zosa Forde | Carolin Holtermann | Lucie-Aimée Kaffee | Tanmay Laud | Anne Lauscher | Roberto L Lopez-Davila | Maraim Masoud | Nikita Nangia | Anaelia Ovalle | Giada Pistilli | Dragomir Radev | Beatrice Savoldi | Vipul Raheja | Jeremy Qin | Esther Ploeger | Arjun Subramonian | Kaustubh Dhole | Kaiser Sun | Amirbek Djanibekov | Jonibek Mansurov | Kayo Yin | Emilio Villa Cueva | Sagnik Mukherjee | Jerry Huang | Xudong Shen | Jay Gala | Hamdan Al-Ali | Tair Djanibekov | Nurdaulet Mukhituly | Shangrui Nie | Shanya Sharma | Karolina Stanczak | Eliza Szczechla | Tiago Timponi Torrent | Deepak Tunuguntla | Marcelo Viridiano | Oskar Van Der Wal | Adina Yakefu | Aurélie Névéol | Mike Zhang | Sydney Zink | Zeerak Talat
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large Language Models (LLMs) reproduce and exacerbate the social biases present in their training data, and resources to quantify this issue are limited. While research has attempted to identify and mitigate such biases, most efforts have been concentrated around English, lagging the rapid advancement of LLMs in multilingual settings. In this paper, we introduce a new multilingual parallel dataset SHADES to help address this issue, designed for examining culturally-specific stereotypes that may be learned by LLMs. The dataset includes stereotypes from 20 regions around the world and 16 languages, spanning multiple identity categories subject to discrimination worldwide. We demonstrate its utility in a series of exploratory evaluations for both “base” and “instruction-tuned” language models. Our results suggest that stereotypes are consistently reflected across models and languages, with some languages and models indicating much stronger stereotype biases than others.
Search
Fix author
Co-authors
- Hamdan Al-Ali 1
- Giuseppe Attanasio 1
- Ioana Baldini 1
- Yilin Cheng 1
- Miruna Clinciu 1
- Jordan Clive 1
- Pieter Delobelle 1
- Manan Dey 1
- Kaustubh Dhole 1
- Timm Dill 1
- Amirbek Djanibekov 1
- Haoyu Dong 1
- Xuanyu Dong 1
- Jad Doughman 1
- Ritam Dutt 1
- Jessica Zosa Forde 1
- Jay Gala 1
- Yan Gao 1
- Avijit Ghosh 1
- Sil Hamilton 1
- Carolin Holtermann 1
- Jerry Huang 1
- Lucie-Aimée Kaffee 1
- Tanmay Laud 1
- Anne Lauscher 1
- Roberto L Lopez-Davila 1
- Mingzhe Lu 1
- Jonibek Mansurov 1
- Maraim Masoud 1
- Margaret Mitchell 1
- Sagnik Mukherjee 1
- Nurdaulet Mukhituly 1
- Nikita Nangia 1
- Aurelie Neveol 1
- Shangrui Nie 1
- Anaelia Ovalle 1
- Giada Pistilli 1
- Esther Ploeger 1
- Jeremy Qin 1
- Dragomir Radev 1
- Vipul Raheja 1
- Beatrice Savoldi 1
- Shanya Sharma 1
- Xudong Shen 1
- Karolina Stanczak 1
- Arjun Subramonian 1
- Kaiser Sun 1
- Eliza Szczechla 1
- Tair Djanibekov 1
- Zeerak Talat 1
- Tiago Timponi Torrent 1
- Deepak Tunuguntla 1
- Oskar Van Der Wal 1
- Emilio Villa-Cueva 1
- Marcelo Viridiano 1
- Kayo Yin 1
- Mike Zhang 1
- Pengkun Zhang 1
- Shuxin Zheng 1
- Sydney Zink 1