Abstract
In the paper we present an outline of our approach to identify languages and encoding schemes in extremely large sets of multi-lingual documents. The large sets we are analyzing in our Language Observatory project [1] are formed by dozens of millions of text documents. In the paper we present an approach which allows us to analyze about 250 documents every second (about 20 million documents/day) on a single Linux machine. Using a multithread processing on a cluster of Linux servers we are able to analyze easily more than 100 million documents/day.- Anthology ID:
- 2005.mtsummit-posters.5
- Volume:
- Proceedings of Machine Translation Summit X: Posters
- Month:
- September 13-15
- Year:
- 2005
- Address:
- Phuket, Thailand
- Venue:
- MTSummit
- SIG:
- Publisher:
- Note:
- Pages:
- 354–355
- Language:
- URL:
- https://aclanthology.org/2005.mtsummit-posters.5
- DOI:
- Cite (ACL):
- Pavol Zavarsky, Yoshiki Mikami, and Shota Wada. 2005. Language and Encoding Scheme Identification of Extremely Large Sets of Multilingual Text. In Proceedings of Machine Translation Summit X: Posters, pages 354–355, Phuket, Thailand.
- Cite (Informal):
- Language and Encoding Scheme Identification of Extremely Large Sets of Multilingual Text (Zavarsky et al., MTSummit 2005)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2005.mtsummit-posters.5.pdf