Translating User-Generated Content in the Social Networking Space

Jie Jiang, Andy Way, Rejwanul Haque


Abstract
This paper presents a case-study of work done by Applied Language Solutions (ALS) for a large social networking provider who claim to have built the world’s first multi-language social network, where Internet users from all over the world can communicate in languages that are available in the system. In an initial phase, the social networking provider contracted ALS to build Machine Translation (MT) engines for twelve language-pairs: Russian⇔English, Russian⇔Turkish, Russian⇔Arabic, Turkish⇔English, Turkish⇔Arabic and Arabic⇔English. All of the input data is user-generated content, so we faced a number of problems in building large-scale, robust, high-quality engines. Primarily, much of the source-language data is of ‘poor’ or at least ‘non-standard’ quality. This comes in many forms: (i) content produced by non-native speakers, (ii) content produced by native speakers containing non-deliberate typos, or (iii) content produced by native speakers which deliberately departs from spelling norms to bring about some linguistic effect. Accordingly, in addition to the ‘regular’ pre-processing techniques used in the building of our statistical MT systems, we needed to develop routines to deal with all these scenarios. In this paper, we describe how we handle shortforms, acronyms, typos, punctuation errors, non-dictionary slang, wordplay, censor avoidance and emoticons. We demonstrate automatic evaluation scores on the social network data, together with insights from the the social networking provider regarding some of the typical errors made by the MT engines, and how we managed to correct these in the engines.
Anthology ID:
2012.amta-commercial.8
Volume:
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Commercial MT User Program
Month:
October 28-November 1
Year:
2012
Address:
San Diego, California, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
Language:
URL:
https://aclanthology.org/2012.amta-commercial.8
DOI:
Bibkey:
Cite (ACL):
Jie Jiang, Andy Way, and Rejwanul Haque. 2012. Translating User-Generated Content in the Social Networking Space. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Commercial MT User Program, San Diego, California, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Translating User-Generated Content in the Social Networking Space (Jiang et al., AMTA 2012)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2012.amta-commercial.8.pdf