Abstract
Traditional supervised text classifiers require a large number of manually labeled documents, which are often expensive to obtain. Recently, dataless text classification has attracted more attention, since it only requires very few seed words of categories that are much cheaper. In this paper, we develop a pseudo-label based dataless Naive Bayes (PL-DNB) classifier with seed words. We initialize pseudo-labels for each document using seed word occurrences, and employ the expectation maximization algorithm to train PL-DNB in a semi-supervised manner. The pseudo-labels are iteratively updated using a mixture of seed word occurrences and estimations of label posteriors. To avoid noisy pseudo-labels, we also consider the information of nearest neighboring documents in the pseudo-label update step, i.e., preserving local neighborhood structure of documents. We empirically show that PL-DNB outperforms traditional dataless text classification algorithms with seed words. Especially, PL-DNB performs well on the imbalanced dataset.- Anthology ID:
- C18-1162
- Volume:
- Proceedings of the 27th International Conference on Computational Linguistics
- Month:
- August
- Year:
- 2018
- Address:
- Santa Fe, New Mexico, USA
- Editors:
- Emily M. Bender, Leon Derczynski, Pierre Isabelle
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1908–1917
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/C18-1162/
- DOI:
- Cite (ACL):
- Ximing Li and Bo Yang. 2018. A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1908–1917, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Cite (Informal):
- A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words (Li & Yang, COLING 2018)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/C18-1162.pdf