Abstract
We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages. We base our solution on a pre-trained byte-level language model, ByT5 (Xue et al., 2021a), which we further pre-train on synthetic data and then fine-tune on authentic normalization data. Our system achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. The source code is released at https://github.com/ufal/multilexnorm2021 and the fine-tuned models at https://huggingface.co/ufal.- Anthology ID:
- 2021.wnut-1.54
- Volume:
- Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
- Month:
- November
- Year:
- 2021
- Address:
- Online
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 483–492
- Language:
- URL:
- https://aclanthology.org/2021.wnut-1.54
- DOI:
- 10.18653/v1/2021.wnut-1.54
- Cite (ACL):
- David Samuel and Milan Straka. 2021. ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 483–492, Online. Association for Computational Linguistics.
- Cite (Informal):
- ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5 (Samuel & Straka, WNUT 2021)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2021.wnut-1.54.pdf
- Code
- ufal/multilexnorm2021