Abstract
While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language—even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non-English, with attention to ambiguity, code-switching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards pre-existing language classifiers. Second, we find that a demographic language model—which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter—can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors. Our dataset and identifier ensemble are available online.- Anthology ID:
- W17-4408
- Volume:
- Proceedings of the 3rd Workshop on Noisy User-generated Text
- Month:
- September
- Year:
- 2017
- Address:
- Copenhagen, Denmark
- Editors:
- Leon Derczynski, Wei Xu, Alan Ritter, Tim Baldwin
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 56–61
- Language:
- URL:
- https://aclanthology.org/W17-4408
- DOI:
- 10.18653/v1/W17-4408
- Cite (ACL):
- Su Lin Blodgett, Johnny Wei, and Brendan O’Connor. 2017. A Dataset and Classifier for Recognizing Social Media English. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 56–61, Copenhagen, Denmark. Association for Computational Linguistics.
- Cite (Informal):
- A Dataset and Classifier for Recognizing Social Media English (Blodgett et al., WNUT 2017)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/W17-4408.pdf