Clara Boesenberg

2026

Trainable, Multiword-aware Linguistic Tokenization Using Modern Neural Networks
Clara Boesenberg | Kilian Evang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

We revisit MWE-aware linguistic tokenization as a character-level and token-level sequence labeling problem and present a systematic evaluation on English, German, Italian, and Dutch data. We compare a standard tokenizer trained without MWE-awareness as a baseline (UDPipe), a character-level SRN+CRF model (Elephant), and transformer-based MaChAmp models trained either directly on gold character labels or as token-level postprocessors on top of UDPipe. Our results show that the two-stage pipeline – UDPipe pretokenization followed by MaChAmp postprocessing – consistently yields the best accuracy. Our analysis of error patterns highlights how different architectures trade off over- and undersegmentation. These findings provide practical guidance for building MWE-aware tokenizers and suggest that postprocessing pipelines with transformers are a strong and general strategy for non-standard tokenization.

Co-authors

Kilian Evang 1

Venues

EACL1

Fix author