Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages

Shohei Higashiyama, Masao Utiyama


Abstract
Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pre-trained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.
Anthology ID:
2025.findings-emnlp.684
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12774–12799
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.684/
DOI:
10.18653/v1/2025.findings-emnlp.684
Bibkey:
Cite (ACL):
Shohei Higashiyama and Masao Utiyama. 2025. Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 12774–12799, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages (Higashiyama & Utiyama, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.684.pdf
Checklist:
 2025.findings-emnlp.684.checklist.pdf