INMT-Lite: Accelerating Low-Resource Language Data Collection via Offline Interactive Neural Machine Translation

Harshita Diddee, Anurag Shukla, Tanuja Ganu, Vivek Seshadri, Sandipan Dandapat, Monojit Choudhury, Kalika Bali


Abstract
A steady increase in the performance of Massively Multilingual Models (MMLMs) has contributed to their rapidly increasing use in data collection pipelines. Interactive Neural Machine Translation (INMT) systems are one class of tools that can utilize MMLMs to promote such data collection in several under-resourced languages. However, these tools are often not adapted to the deployment constraints that native language speakers operate in, as bloated, online inference-oriented MMLMs trained for data-rich languages, drive them. INMT-Lite addresses these challenges through its support of (1) three different modes of Internet-independent deployment and (2) a suite of four assistive interfaces suitable for (3) data-sparse languages. We perform an extensive user study for INMT-Lite with an under-resourced language community, Gondi, to find that INMT-Lite improves the data generation experience of community members along multiple axes, such as cognitive load, task productivity, and interface interaction time and effort, without compromising on the quality of the generated translations.INMT-Lite’s code is open-sourced to further research in this domain.
Anthology ID:
2024.lrec-main.797
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
9097–9109
Language:
URL:
https://aclanthology.org/2024.lrec-main.797
DOI:
Bibkey:
Cite (ACL):
Harshita Diddee, Anurag Shukla, Tanuja Ganu, Vivek Seshadri, Sandipan Dandapat, Monojit Choudhury, and Kalika Bali. 2024. INMT-Lite: Accelerating Low-Resource Language Data Collection via Offline Interactive Neural Machine Translation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9097–9109, Torino, Italia. ELRA and ICCL.
Cite (Informal):
INMT-Lite: Accelerating Low-Resource Language Data Collection via Offline Interactive Neural Machine Translation (Diddee et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.797.pdf
Optional supplementary material:
 2024.lrec-main.797.OptionalSupplementaryMaterial.zip