Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slips
Yingfa Chen, Chenlong Hu, Cong Feng, Chenyang Song, Shi Yu, Xu Han, Zhiyuan Liu, Maosong Sun
Abstract
This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries. Then it conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts.- Anthology ID:
- 2025.coling-main.414
- Volume:
- Proceedings of the 31st International Conference on Computational Linguistics
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi, UAE
- Editors:
- Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6201–6211
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.414/
- DOI:
- Cite (ACL):
- Yingfa Chen, Chenlong Hu, Cong Feng, Chenyang Song, Shi Yu, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slips. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6201–6211, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slips (Chen et al., COLING 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.414.pdf