Abstract
In natural language processing, multi-dataset benchmarks for common tasks (e.g., SuperGLUE for natural language inference and MRQA for question answering) have risen in importance. Invariably, tasks and individual examples vary in difficulty. Recent analysis methods infer properties of examples such as difficulty. In particular, Item Response Theory (IRT) jointly infers example and model properties from the output of benchmark tasks (i.e., scores for each model-example pair). Therefore, it seems sensible that methods like IRT should be able to detect differences between datasets in a task. This work shows that current IRT models are not as good at identifying differences as we would expect, explain why this is difficult, and outline future directions that incorporate more (textual) signal from examples.- Anthology ID:
- 2022.insights-1.14
- Volume:
- Proceedings of the Third Workshop on Insights from Negative Results in NLP
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Editors:
- Shabnam Tafreshi, João Sedoc, Anna Rogers, Aleksandr Drozd, Anna Rumshisky, Arjun Akula
- Venue:
- insights
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 100–112
- Language:
- URL:
- https://aclanthology.org/2022.insights-1.14
- DOI:
- 10.18653/v1/2022.insights-1.14
- Cite (ACL):
- Pedro Rodriguez, Phu Mon Htut, John Lalor, and João Sedoc. 2022. Clustering Examples in Multi-Dataset Benchmarks with Item Response Theory. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 100–112, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Clustering Examples in Multi-Dataset Benchmarks with Item Response Theory (Rodriguez et al., insights 2022)
- PDF:
- https://preview.aclanthology.org/fix-dup-bibkey/2022.insights-1.14.pdf
- Data
- DynaSent, MRQA, SST, SuperGLUE