CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese
Le Qiu, Shanyue Guo, Tak-Sum Wong, Emmanuele Chersoni, John Lee, Chu-Ren Huang
Abstract
The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the first step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.- Anthology ID:
- 2024.tsar-1.3
- Volume:
- Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Matthew Shardlow, Horacio Saggion, Fernando Alva-Manchego, Marcos Zampieri, Kai North, Sanja Štajner, Regina Stodden
- Venues:
- TSAR | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 20–26
- Language:
- URL:
- https://preview.aclanthology.org/Author-page-Marten-During-lu/2024.tsar-1.3/
- DOI:
- 10.18653/v1/2024.tsar-1.3
- Cite (ACL):
- Le Qiu, Shanyue Guo, Tak-Sum Wong, Emmanuele Chersoni, John Lee, and Chu-Ren Huang. 2024. CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese. In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 20–26, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese (Qiu et al., TSAR 2024)
- PDF:
- https://preview.aclanthology.org/Author-page-Marten-During-lu/2024.tsar-1.3.pdf