ACAData: Parallel Dataset of Academic Data for Machine Translation

Iñaki Lacunza; Javier García Gilabert; Francesca De Luca Fornaciari; Javier Aula-Blasco; Aitor González-Agirre; Maite Melero; Marta Villegas

ACAData: Parallel Dataset of Academic Data for Machine Translation

Iñaki Lacunza, Javier Garcia Gilabert, Francesca De Luca Fornaciari, Javier Aula-Blasco, Aitor Gonzalez-Agirre, Maite Melero, Marta Villegas

Abstract

We present ACAData, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-Train, which contains approximately 1.5 million human-generated paragraph pairs across 12 languages, and ACAD-Bench, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its usefulness, we fine-tune two Large Language Models (LLMs) on ACAD-Train and benchmark them on ACAD-Bench against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine tuning on ACAD-Train leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best proprietary and open-weight models on the academic translation domain. By releasing ACAD-Train, ACAD-Bench and the fine-tuned models, we provide the community with a valuable resource to advance research in the academic domain and long-context translation.

Anthology ID:: 2026.lrec-main.671
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 8498–8519
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.671/
DOI:
Bibkey:
Cite (ACL):: Iñaki Lacunza, Javier Garcia Gilabert, Francesca De Luca Fornaciari, Javier Aula-Blasco, Aitor Gonzalez-Agirre, Maite Melero, and Marta Villegas. 2026. ACAData: Parallel Dataset of Academic Data for Machine Translation. International Conference on Language Resources and Evaluation, main:8498–8519.
Cite (Informal):: ACAData: Parallel Dataset of Academic Data for Machine Translation (Lacunza et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.671.pdf

PDF Cite Search Fix data