Abstract
Part of speech tagging is a fundamental NLP task often regarded as solved for high-resource languages such as English. Current state-of-the-art models have achieved high accuracy, especially on the news domain. However, when these models are applied to other corpora with different genres, and especially user-generated data from the Web, we see substantial drops in performance. In this work, we study how a state-of-the-art tagging model trained on different genres performs on Web content from unfiltered Reddit forum discussions. We report the results when training on different splits of the data, tested on Reddit. Our results show that even small amounts of in-domain data can outperform the contribution of data an order of magnitude larger coming from other Web domains. To make progress on out-of-domain tagging, we also evaluate an ensemble approach using multiple single-genre taggers as input features to a meta-classifier. We present state of the art performance on tagging Reddit data, as well as error analysis of the results of these models, and offer a typology of the most common error types among them, broken down by training corpus.- Anthology ID:
- 2020.wac-1.7
- Volume:
- Proceedings of the 12th Web as Corpus Workshop
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Adrien Barbaresi, Felix Bildhauer, Roland Schäfer, Egon Stemle
- Venue:
- WAC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 50–56
- Language:
- English
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2020.wac-1.7/
- DOI:
- Cite (ACL):
- Shabnam Behzad and Amir Zeldes. 2020. A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging. In Proceedings of the 12th Web as Corpus Workshop, pages 50–56, Marseille, France. European Language Resources Association.
- Cite (Informal):
- A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging (Behzad & Zeldes, WAC 2020)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2020.wac-1.7.pdf
- Code
- shabnam-b/reddit-pos-ensemble
- Data
- English Web Treebank, GUM