Daniil Moskovskiy
2022
ParaDetox: Detoxification with Parallel Data
Varvara Logacheva
|
Daryna Dementieva
|
Sergey Ustyantsev
|
Daniil Moskovskiy
|
David Dale
|
Irina Krotova
|
Nikita Semenov
|
Alexander Panchenko
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present a novel pipeline for the collection of parallel data for the detoxification task. We collect non-toxic paraphrases for over 10,000 English toxic sentences. We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic-neutral sentence pairs. We release two parallel corpora which can be used for the training of detoxification models. To the best of our knowledge, these are the first parallel datasets for this task.We describe our pipeline in detail to make it fast to set up for a new language or domain, thus contributing to faster and easier development of new parallel resources.We train several detoxification models on the collected data and compare them with several baselines and state-of-the-art unsupervised approaches. We conduct both automatic and manual evaluations. All models trained on parallel data outperform the state-of-the-art unsupervised models by a large margin. This suggests that our novel datasets can boost the performance of detoxification systems.
Exploring Cross-lingual Text Detoxification with Large Multilingual Language Models.
Daniil Moskovskiy
|
Daryna Dementieva
|
Alexander Panchenko
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Detoxification is a task of generating text in polite style while preserving meaning and fluency of the original toxic text. Existing detoxification methods are monolingual i.e. designed to work in one exact language. This work investigates multilingual and cross-lingual detoxification and the behavior of large multilingual models in this setting. Unlike previous works we aim to make large language models able to perform detoxification without direct fine-tuning in a given language. Experiments show that multilingual models are capable of performing multilingual style transfer. However, tested state-of-the-art models are not able to perform cross-lingual detoxification and direct fine-tuning on exact language is currently inevitable and motivating the need of further research in this direction.
Search
Co-authors
- Daryna Dementieva 2
- Alexander Panchenko 2
- Varvara Logacheva 1
- Sergey Ustyantsev 1
- David Dale 1
- show all...
Venues
- acl2