ChengyuSTS: An Intrinsic Perspective on Mandarin Idiom Representation

Le Qiu, Emmanuele Chersoni, Aline Villavicencio


Abstract
Chengyu, or four-character idioms, are ubiquitous in both spoken and written Chinese. Despite their importance, chengyu are often underexplored in NLP tasks, and existing evaluation frameworks are primarily based on extrinsic measures. In this paper, we introduce an intrinsic evaluation task for Chinese idiomatic understanding: idiomatic semantic textual similarity (iSTS), which evaluates how well models can capture the semantic similarity of sentences containing idioms. To this purpose, we present a curated dataset: ChengyuSTS. Our experiments show that current pre-trained sentence Transformer models generally fail to capture the idiomaticity of chengyu in a zero-shot setting. We then show results of fine-tuned models using the SimCSE contrastive learning framework, which demonstrate promising results for handling idiomatic expressions. We also presented the results of DeepSeek for reference
Anthology ID:
2025.starsem-1.1
Volume:
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lea Frermann, Mark Stevenson
Venue:
*SEM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–12
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.starsem-1.1/
DOI:
Bibkey:
Cite (ACL):
Le Qiu, Emmanuele Chersoni, and Aline Villavicencio. 2025. ChengyuSTS: An Intrinsic Perspective on Mandarin Idiom Representation. In Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025), pages 1–12, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
ChengyuSTS: An Intrinsic Perspective on Mandarin Idiom Representation (Qiu et al., *SEM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.starsem-1.1.pdf