Sami Itkonen
2021
Coping with Noisy Training Data Labels in Paraphrase Detection
Teemu Vahtola
|
Mathias Creutz
|
Eetu Sjöblom
|
Sami Itkonen
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
We present new state-of-the-art benchmarks for paraphrase detection on all six languages in the Opusparcus sentential paraphrase corpus: English, Finnish, French, German, Russian, and Swedish. We reach these baselines by fine-tuning BERT. The best results are achieved on smaller and cleaner subsets of the training sets than was observed in previous research. Additionally, we study a translation-based approach that is competitive for the languages with more limited and noisier training data.