Abstract
This paper presents an algorithm and implementation for efficient tokenization of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.- Anthology ID:
- 2022.cmlc-1.4
- Volume:
- Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Piotr Banski, Adrien Barbaresi, Simon Clematide, Marc Kupietz, Harald Lüngen
- Venue:
- CMLC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 20–26
- Language:
- URL:
- https://aclanthology.org/2022.cmlc-1.4
- DOI:
- Cite (ACL):
- Nils Diewald. 2022. Matrix and Double-Array Representations for Efficient Finite State Tokenization. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10), pages 20–26, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Matrix and Double-Array Representations for Efficient Finite State Tokenization (Diewald, CMLC 2022)
- PDF:
- https://preview.aclanthology.org/emnlp-22-attachments/2022.cmlc-1.4.pdf