Feature Hashing for Language and Dialect Identification

Shervin Malmasi; Mark Dras

doi:10.18653/v1/P17-2063

Feature Hashing for Language and Dialect Identification

Abstract

We evaluate feature hashing for language identification (LID), a method not previously used for this task. Using a standard dataset, we first show that while feature performance is high, LID data is highly dimensional and mostly sparse (>99.5%) as it includes large vocabularies for many languages; memory requirements grow as languages are added. Next we apply hashing using various hash sizes, demonstrating that there is no performance loss with dimensionality reductions of up to 86%. We also show that using an ensemble of low-dimension hash-based classifiers further boosts performance. Feature hashing is highly useful for LID and holds great promise for future work in this area.

Anthology ID:: P17-2063
Volume:: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2017
Address:: Vancouver, Canada
Editors:: Regina Barzilay, Min-Yen Kan
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 399–403
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/P17-2063/
DOI:: 10.18653/v1/P17-2063
Bibkey:
Cite (ACL):: Shervin Malmasi and Mark Dras. 2017. Feature Hashing for Language and Dialect Identification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 399–403, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):: Feature Hashing for Language and Dialect Identification (Malmasi & Dras, ACL 2017)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/P17-2063.pdf

PDF Cite Search Fix data