Automatic Detection and Language Identification of Multilingual Documents

Marco Lui; Jey Han Lau; Timothy Baldwin

doi:10.1162/tacl_a_00163

Automatic Detection and Language Identification of Multilingual Documents

Abstract

Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language (multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions. We demonstrate the effectiveness of our method over synthetic data, as well as real-world multilingual documents collected from the web.

Anthology ID:: Q14-1003
Volume:: Transactions of the Association for Computational Linguistics, Volume 2
Month:
Year:: 2014
Address:: Cambridge, MA
Editors:: Dekang Lin, Michael Collins, Lillian Lee
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 27–40
Language:
URL:: https://aclanthology.org/Q14-1003
DOI:: 10.1162/tacl_a_00163
Bibkey:
Cite (ACL):: Marco Lui, Jey Han Lau, and Timothy Baldwin. 2014. Automatic Detection and Language Identification of Multilingual Documents. Transactions of the Association for Computational Linguistics, 2:27–40.
Cite (Informal):: Automatic Detection and Language Identification of Multilingual Documents (Lui et al., TACL 2014)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-3/Q14-1003.pdf

PDF Search