Abstract
This paper revisits tokenization from a theoretical perspective, and argues for the necessity of a constructivist approach to tokenization for semantic parsing and modeling language acquisition. We consider two problems: (1) (semi-) automatically converting existing lexicalist annotations, e.g. those of the Penn TreeBank, into constructivist annotations, and (2) automatic tokenization of raw texts. We demonstrate that (1) a heuristic rule-based constructivist tokenizer is able to yield relatively satisfactory accuracy when gold standard Penn TreeBank part-of-speech tags are available, but that some manual annotations are still necessary to obtain gold standard results, and (2) a neural tokenizer is able to provide accurate automatic constructivist tokenization results from raw character sequences. Our research output also includes a set of high-quality morpheme-tokenized corpora, which enable the training of computational models that more closely align with language comprehension and acquisition.- Anthology ID:
- 2023.cxgsnlp-1.5
- Volume:
- Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023)
- Month:
- March
- Year:
- 2023
- Address:
- Washington, D.C.
- Editors:
- Claire Bonial, Harish Tayyar Madabushi
- Venues:
- CxGsNLP | SyntaxFest
- SIG:
- SIGPARSE
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 36–40
- Language:
- URL:
- https://aclanthology.org/2023.cxgsnlp-1.5
- DOI:
- Cite (ACL):
- Allison Fan and Weiwei Sun. 2023. Constructivist Tokenization for English. In Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023), pages 36–40, Washington, D.C.. Association for Computational Linguistics.
- Cite (Informal):
- Constructivist Tokenization for English (Fan & Sun, CxGsNLP-SyntaxFest 2023)
- PDF:
- https://preview.aclanthology.org/emnlp-22-attachments/2023.cxgsnlp-1.5.pdf