@inproceedings{phang-etal-2022-adversarially,
    title = "Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair",
    author = "Phang, Jason  and
      Chen, Angelica  and
      Huang, William  and
      Bowman, Samuel R.",
    editor = "Bartolo, Max  and
      Kirk, Hannah  and
      Rodriguez, Pedro  and
      Margatina, Katerina  and
      Thrush, Tristan  and
      Jia, Robin  and
      Stenetorp, Pontus  and
      Williams, Adina  and
      Kiela, Douwe",
    booktitle = "Proceedings of the First Workshop on Dynamic Adversarial Data Collection",
    month = jul,
    year = "2022",
    address = "Seattle, WA",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2022.dadc-1.8/",
    doi = "10.18653/v1/2022.dadc-1.8",
    pages = "62--62",
    abstract = "Large language models increasingly saturate existing task benchmarks, in some cases outperforming humans, leaving little headroom with which to measure further progress. Adversarial dataset creation, which builds datasets using examples that a target system outputs incorrect predictions for, has been proposed as a strategy to construct more challenging datasets, avoiding the more serious challenge of building more precise benchmarks by conventional means. In this work, we study the impact of applying three common approaches for adversarial dataset creation: (1) filtering out easy examples (AFLite), (2) perturbing examples (TextFooler), and (3) model-in-the-loop data collection (ANLI and AdversarialQA), across 18 different adversary models. We find that all three methods can produce more challenging datasets, with stronger adversary models lowering the performance of evaluated models more. However, the resulting ranking of the evaluated models can also be unstable and highly sensitive to the choice of adversary model. Moreover, we find that AFLite oversamples examples with low annotator agreement, meaning that model comparisons hinge on the examples that are most contentious for humans. We recommend that researchers tread carefully when using adversarial methods for building evaluation datasets."
}Markdown (Informal)
[Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair](https://preview.aclanthology.org/ingest-emnlp/2022.dadc-1.8/) (Phang et al., DADC 2022)
ACL