Abstract
Generating texts in scientific papers requires not only capturing the content contained within the given input but also frequently acquiring the external information called context. We push forward the scientific text generation by proposing a new task, namely context-aware text generation in the scientific domain, aiming at exploiting the contributions of context in generated texts. To this end, we present a novel challenging large-scale Scientific Paper Dataset for ConteXt-Aware Text Generation (SciXGen), consisting of well-annotated 205,304 papers with full references to widely-used objects (e.g., tables, figures, algorithms) in a paper. We comprehensively benchmark, using state-of-the-arts, the efficacy of our newly constructed SciXGen dataset in generating description and paragraph. Our dataset and benchmarks will be made publicly available to hopefully facilitate the scientific text generation research.- Anthology ID:
- 2021.findings-emnlp.128
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2021
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- Findings
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1483–1492
- Language:
- URL:
- https://aclanthology.org/2021.findings-emnlp.128
- DOI:
- 10.18653/v1/2021.findings-emnlp.128
- Cite (ACL):
- Hong Chen, Hiroya Takamura, and Hideki Nakayama. 2021. SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1483–1492, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation (Chen et al., Findings 2021)
- PDF:
- https://preview.aclanthology.org/naacl-24-ws-corrections/2021.findings-emnlp.128.pdf
- Data
- S2ORC, unarXive