MASSIVE-Agents: A Benchmark for Multilingual Function-Calling in 52 Languages

Mayank Kulkarni; Vittorio Mazzia; Judith Gaspers; Chris Hench; Jack Fitzgerald

doi:10.18653/v1/2025.findings-emnlp.1099

MASSIVE-Agents: A Benchmark for Multilingual Function-Calling in 52 Languages

Mayank Kulkarni, Vittorio Mazzia, Judith Gaspers, Chris Hench, Jack FitzGerald

Abstract

We present MASSIVE-Agents, a new benchmark for assessing multilingual function calling across 52 languages. We created MASSIVE-Agents by cleaning the original MASSIVE dataset and then reformatting it for evaluation within the Berkeley Function-Calling Leaderboard (BFCL) framework. The full benchmark comprises 47,020 samples with an average of 904 samples per language, covering 55 different functions and 286 arguments. We benchmarked 21 models using Amazon Bedrock and present the results along with associated analyses. MASSIVE-Agents is challenging, with the top model Nova Premier achieving an average Abstract Syntax Tree (AST) Accuracy of 34.05% across all languages, with performance varying significantly from 57.37% for English to as low as 6.81% for Amharic. Some models, particularly smaller ones, yielded a score of zero for the more difficult languages. Additionally, we provide results from ablations using a custom 1-shot prompt, ablations with prompts translated into different languages, and comparisons based on model latency.

Anthology ID:: 2025.findings-emnlp.1099
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20193–20215
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1099/
DOI:: 10.18653/v1/2025.findings-emnlp.1099
Bibkey:
Cite (ACL):: Mayank Kulkarni, Vittorio Mazzia, Judith Gaspers, Chris Hench, and Jack FitzGerald. 2025. MASSIVE-Agents: A Benchmark for Multilingual Function-Calling in 52 Languages. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20193–20215, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: MASSIVE-Agents: A Benchmark for Multilingual Function-Calling in 52 Languages (Kulkarni et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1099.pdf
Checklist:: 2025.findings-emnlp.1099.checklist.pdf

PDF Cite Search Checklist Fix data