NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer

Hwijeen Ahn; Jimin Sun; Chan Young Park; Jungyun Seo

doi:10.18653/v1/2020.semeval-1.206

NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer

Hwijeen Ahn, Jimin Sun, Chan Young Park, Jungyun Seo

Abstract

This paper describes our approach to the task of identifying offensive languages in a multilingual setting. We investigate two data augmentation strategies: using additional semi-supervised labels with different thresholds and cross-lingual transfer with data selection. Leveraging the semi-supervised dataset resulted in performance improvements compared to the baseline trained solely with the manually-annotated dataset. We propose a new metric, Translation Embedding Distance, to measure the transferability of instances for cross-lingual data selection. We also introduce various preprocessing steps tailored for social media text along with methods to fine-tune the pre-trained multilingual BERT (mBERT) for offensive language identification. Our multilingual systems achieved competitive results in Greek, Danish, and Turkish at OffensEval 2020.

Anthology ID:: 2020.semeval-1.206
Volume:: Proceedings of the Fourteenth Workshop on Semantic Evaluation
Month:: December
Year:: 2020
Address:: Barcelona (online)
Editors:: Aurelie Herbelot, Xiaodan Zhu, Alexis Palmer, Nathan Schneider, Jonathan May, Ekaterina Shutova
Venue:: SemEval
SIG:: SIGLEX
Publisher:: International Committee for Computational Linguistics
Note:
Pages:: 1576–1586
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.semeval-1.206/
DOI:: 10.18653/v1/2020.semeval-1.206
Bibkey:
Cite (ACL):: Hwijeen Ahn, Jimin Sun, Chan Young Park, and Jungyun Seo. 2020. NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1576–1586, Barcelona (online). International Committee for Computational Linguistics.
Cite (Informal):: NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer (Ahn et al., SemEval 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.semeval-1.206.pdf
Code: hwijeen/OffensEval2020
Data: OLID

PDF Cite Search Code Fix data