Massively Multilingual Instruction-Following Information Extraction

Thang Le, Huy Huu Nguyen, Anh Tuan Luu, Thien Huu Nguyen


Abstract
The literature on information extraction (IE) has mostly centered around a selected few languages, hindering their applications on multilingual corpora. In this work, we introduce MASSIE - a comprehensive collection for instruction-following multilingual IE that standardizes and unifies 215 manually annotated datasets, covering 96 typologically diverse languages from 18 language families. Based on MASSIE, we conduct empirical studies on few-shot in-context learning and report important factors that either positively or negatively affect LLMs’ performance in multilingual IE, covering 21 LLMs sizing from 0.5B to 72B. Additionally, we introduce LF1 - a structure-aware metric that captures partially matched spans, resolving the conservativeness of standard exact matching scheme which overpenalizes LLMs’ predictions. Overall, our results signify that multilingual IE remains very challenging for existing LLMs, especially on complex tasks involving relations and events. In addition, performance gap is extremely large among high- and low-performing languages, but the group of similar-performing languages largely overlap between different LLMs, suggesting a shared performance bias in current LLMs.
Anthology ID:
2025.findings-acl.182
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3542–3585
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.182/
DOI:
Bibkey:
Cite (ACL):
Thang Le, Huy Huu Nguyen, Anh Tuan Luu, and Thien Huu Nguyen. 2025. Massively Multilingual Instruction-Following Information Extraction. In Findings of the Association for Computational Linguistics: ACL 2025, pages 3542–3585, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Massively Multilingual Instruction-Following Information Extraction (Le et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.182.pdf