Nikita Martynov


2022

pdf
RuPAWS: A Russian Adversarial Dataset for Paraphrase Identification
Nikita Martynov | Irina Krotova | Varvara Logacheva | Alexander Panchenko | Olga Kozlova | Nikita Semenov
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Paraphrase identification task can be easily challenged by changing word order, e.g. as in “Can a good person become bad?”. While for English this problem was tackled by the PAWS dataset (Zhang et al., 2019), datasets for Russian paraphrase detection lack non-paraphrase examples with high lexical overlap. We present RuPAWS, the first adversarial dataset for Russian paraphrase identification. Our dataset consists of examples from PAWS translated to the Russian language and manually annotated by native speakers. We compare it to the largest available dataset for Russian ParaPhraser and show that the best available paraphrase identifiers for the Russian language fail on the RuPAWS dataset. At the same time, the state-of-the-art paraphrasing model RuBERT trained on both RuPAWS and ParaPhraser obtains high performance on the RuPAWS dataset while maintaining its accuracy on the ParaPhraser benchmark. We also show that RuPAWS can measure the sensitivity of models to word order and syntax structure since simple baselines fail even when given RuPAWS training samples.