Language and Encoding Scheme Identification of Extremely Large Sets of Multilingual Text

Pavol Zavarsky, Yoshiki Mikami, Shota Wada


Abstract
In the paper we present an outline of our approach to identify languages and encoding schemes in extremely large sets of multi-lingual documents. The large sets we are analyzing in our Language Observatory project [1] are formed by dozens of millions of text documents. In the paper we present an approach which allows us to analyze about 250 documents every second (about 20 million documents/day) on a single Linux machine. Using a multithread processing on a cluster of Linux servers we are able to analyze easily more than 100 million documents/day.
Anthology ID:
2005.mtsummit-posters.5
Volume:
Proceedings of Machine Translation Summit X: Posters
Month:
September 13-15
Year:
2005
Address:
Phuket, Thailand
Venue:
MTSummit
SIG:
Publisher:
Note:
Pages:
354–355
Language:
URL:
https://aclanthology.org/2005.mtsummit-posters.5
DOI:
Bibkey:
Cite (ACL):
Pavol Zavarsky, Yoshiki Mikami, and Shota Wada. 2005. Language and Encoding Scheme Identification of Extremely Large Sets of Multilingual Text. In Proceedings of Machine Translation Summit X: Posters, pages 354–355, Phuket, Thailand.
Cite (Informal):
Language and Encoding Scheme Identification of Extremely Large Sets of Multilingual Text (Zavarsky et al., MTSummit 2005)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2005.mtsummit-posters.5.pdf