Abstract
Sharing datasets and benchmarks has been crucial for rapidly improving Natural Language Processing models and systems. Documenting datasets’ characteristics (and any modification introduced over time) is equally important to avoid confusion and make comparisons reliable. Here, we describe the case of BigPatent, a dataset for patent summarization that exists in at least two rather different versions under the same name. While previous literature has not clearly distinguished among versions, their differences do not only lay on a surface level but also modify the dataset’s core nature and, thus, the complexity of the summarization task. While this paper describes a specific case, we aim to shed light on new challenges that might emerge in resource sharing and advocate for comprehensive documentation of datasets and models.- Anthology ID:
- 2022.gem-1.34
- Volume:
- Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates (Hybrid)
- Venue:
- GEM
- SIG:
- SIGGEN
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 399–404
- Language:
- URL:
- https://aclanthology.org/2022.gem-1.34
- DOI:
- Cite (ACL):
- Silvia Casola, Alberto Lavelli, and Horacio Saggion. 2022. What’s in a (dataset’s) name? The case of BigPatent. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 399–404, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Cite (Informal):
- What’s in a (dataset’s) name? The case of BigPatent (Casola et al., GEM 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.gem-1.34.pdf