Abstract
This paper lays the groundwork for initiating research into Source Language Identification; the task of identifying the original language of a machine-translated text. We contribute a dataset of translations from a typologically diverse spectrum of languages into English and use it to set initial baselines for this novel task.- Anthology ID:
- 2024.sigtyp-1.8
- Volume:
- Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
- Month:
- March
- Year:
- 2024
- Address:
- St. Julian's, Malta
- Editors:
- Michael Hahn, Alexey Sorokin, Ritesh Kumar, Andreas Shcherbakov, Yulia Otmakhova, Jinrui Yang, Oleg Serikov, Priya Rani, Edoardo M. Ponti, Saliha Muradoğlu, Rena Gao, Ryan Cotterell, Ekaterina Vylomova
- Venues:
- SIGTYP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 58–65
- Language:
- URL:
- https://aclanthology.org/2024.sigtyp-1.8
- DOI:
- Cite (ACL):
- Damiaan Reijnaers and Charlotte Pouw. 2024. GTNC: A Many-To-One Dataset of Google Translations from NewsCrawl. In Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 58–65, St. Julian's, Malta. Association for Computational Linguistics.
- Cite (Informal):
- GTNC: A Many-To-One Dataset of Google Translations from NewsCrawl (Reijnaers & Pouw, SIGTYP-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/2024.sigtyp-1.8.pdf