Abstract
Word embeddings, in their different shapes and iterations, have changed the natural language processing research landscape in the last years. The biomedical text processing field is no stranger to this revolution; however, scholars in the field largely trained their embeddings on scientific documents only, even when working on user-generated data. In this paper we show how training embeddings from a corpus collected from user-generated text from medical forums heavily influences the performance on downstream tasks, outperforming embeddings trained both on general purpose data or on scientific papers when applied on user-generated content.- Anthology ID:
- D19-6205
- Volume:
- Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong
- Editors:
- Eben Holderness, Antonio Jimeno Yepes, Alberto Lavelli, Anne-Lyse Minard, James Pustejovsky, Fabio Rinaldi
- Venue:
- Louhi
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 34–38
- Language:
- URL:
- https://aclanthology.org/D19-6205
- DOI:
- 10.18653/v1/D19-6205
- Cite (ACL):
- Marco Basaldella and Nigel Collier. 2019. BioReddit: Word Embeddings for User-Generated Biomedical NLP. In Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), pages 34–38, Hong Kong. Association for Computational Linguistics.
- Cite (Informal):
- BioReddit: Word Embeddings for User-Generated Biomedical NLP (Basaldella & Collier, Louhi 2019)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/D19-6205.pdf