SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation

Hong Chen, Hiroya Takamura, Hideki Nakayama


Abstract
Generating texts in scientific papers requires not only capturing the content contained within the given input but also frequently acquiring the external information called context. We push forward the scientific text generation by proposing a new task, namely context-aware text generation in the scientific domain, aiming at exploiting the contributions of context in generated texts. To this end, we present a novel challenging large-scale Scientific Paper Dataset for ConteXt-Aware Text Generation (SciXGen), consisting of well-annotated 205,304 papers with full references to widely-used objects (e.g., tables, figures, algorithms) in a paper. We comprehensively benchmark, using state-of-the-arts, the efficacy of our newly constructed SciXGen dataset in generating description and paragraph. Our dataset and benchmarks will be made publicly available to hopefully facilitate the scientific text generation research.
Anthology ID:
2021.findings-emnlp.128
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venues:
EMNLP | Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
1483–1492
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.128
DOI:
10.18653/v1/2021.findings-emnlp.128
Bibkey:
Cite (ACL):
Hong Chen, Hiroya Takamura, and Hideki Nakayama. 2021. SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1483–1492, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation (Chen et al., Findings 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2021.findings-emnlp.128.pdf
Data
S2ORCunarXive