Abstract
Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can bean important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of un-natural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various for-mats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that re-moving unnatural language components gives an absolute improvement in document cluster-ing by up to 15%. Our corpus and tool are publicly available- Anthology ID:
- W17-4416
- Volume:
- Proceedings of the 3rd Workshop on Noisy User-generated Text
- Month:
- September
- Year:
- 2017
- Address:
- Copenhagen, Denmark
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 122–130
- Language:
- URL:
- https://aclanthology.org/W17-4416
- DOI:
- 10.18653/v1/W17-4416
- Cite (ACL):
- Myungha Jang, Jinho D. Choi, and James Allan. 2017. Improving Document Clustering by Removing Unnatural Language. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 122–130, Copenhagen, Denmark. Association for Computational Linguistics.
- Cite (Informal):
- Improving Document Clustering by Removing Unnatural Language (Jang et al., WNUT 2017)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/W17-4416.pdf