GANDALF: a General Character Name Description Dataset for Long Fiction

Fredrik Carlsson; Magnus Sahlgren; Fredrik Olsson; Amaru Cuba Gyllensten

doi:10.18653/v1/2021.mrqa-1.13

GANDALF: a General Character Name Description Dataset for Long Fiction

Fredrik Carlsson, Magnus Sahlgren, Fredrik Olsson, Amaru Cuba Gyllensten

Abstract

This paper introduces a long-range multiple-choice Question Answering (QA) dataset, based on full-length fiction book texts. The questions are formulated as 10-way multiple-choice questions, where the task is to select the correct character name given a character description, or vice-versa. Each character description is formulated in natural text and often contains information from several sections throughout the book. We provide 20,000 questions created from 10,000 manually annotated descriptions of characters from 177 books containing 152,917 words on average. We address the current discourse regarding dataset bias and leakage by a simple anonymization procedure, which in turn enables interesting probing possibilities. Finally, we show that suitable baseline algorithms perform very poorly on this task, with the book size itself making it non-trivial to attempt a Transformer-based QA solution. This leaves ample room for future improvement, and hints at the need for a completely different type of solution.

Anthology ID:: 2021.mrqa-1.13
Volume:: Proceedings of the 3rd Workshop on Machine Reading for Question Answering
Month:: November
Year:: 2021
Address:: Punta Cana, Dominican Republic
Venues:: EMNLP | MRQA
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 119–132
Language:
URL:: https://aclanthology.org/2021.mrqa-1.13
DOI:: 10.18653/v1/2021.mrqa-1.13
Bibkey:
Cite (ACL):: Fredrik Carlsson, Magnus Sahlgren, Fredrik Olsson, and Amaru Cuba Gyllensten. 2021. GANDALF: a General Character Name Description Dataset for Long Fiction. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 119–132, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: GANDALF: a General Character Name Description Dataset for Long Fiction (Carlsson et al., MRQA 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/update-css-js/2021.mrqa-1.13.pdf
Data: BookTest, GLUE, SuperGLUE

PDF Cite Search