A Treebank for the Healthcare Domain
Nganthoibi Oinam, Diwakar Mishra, Pinal Patel, Narayan Choudhary, Hitesh Desai
Abstract
This paper presents a treebank for the healthcare domain developed at ezDI. The treebank is created from a wide array of clinical health record documents across hospitals. The data has been de-identified and annotated for constituent syntactic structure. The treebank contains a total of 52053 sentences that have been sampled for subdomains as well as linguistic variations. The paper outlines the sampling process followed to ensure a better domain representation in the corpus, the annotation process and challenges, and corpus statistics. The Penn Treebank tagset and guidelines were largely followed, but there were many syntactic contexts that warranted adaptation of the guidelines. The treebank created was used to re-train the Berkeley parser and the Stanford parser. These parsers were also trained with the GENIA treebank for comparative quality assessment. Our treebank yielded great-er accuracy on both parsers. Berkeley parser performed better on our treebank with an average F1 measure of 91 across 5-folds. This was a significant jump from the out-of-the-box F1 score of 70 on Berkeley parser’s default grammar.- Anthology ID:
- W18-4916
- Volume:
- Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)
- Month:
- August
- Year:
- 2018
- Address:
- Santa Fe, New Mexico, USA
- Editors:
- Agata Savary, Carlos Ramisch, Jena D. Hwang, Nathan Schneider, Melanie Andresen, Sameer Pradhan, Miriam R. L. Petruck
- Venues:
- LAW | MWE
- SIGs:
- SIGANN | SIGLEX
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 144–155
- Language:
- URL:
- https://aclanthology.org/W18-4916
- DOI:
- Cite (ACL):
- Nganthoibi Oinam, Diwakar Mishra, Pinal Patel, Narayan Choudhary, and Hitesh Desai. 2018. A Treebank for the Healthcare Domain. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pages 144–155, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Cite (Informal):
- A Treebank for the Healthcare Domain (Oinam et al., LAW-MWE 2018)
- PDF:
- https://preview.aclanthology.org/fix-dup-bibkey/W18-4916.pdf
- Data
- Penn Treebank