Lian Chen


2026

This paper proposes a semi-automatic lexico-semantic modeling framework for Chinese chéngyǔ containing body-part and animal lexemes. The framework combines manual semantic annotation, lightweight RDF/OWL formalization and semantic classification in order to investigate whether lexical mediators such as 心 xīn “heart/mind”, 口 kǒu “mouth” or 马 mǎ “horse” are sufficient to predict idiomatic semantic interpretation. Based on 440 annotated chéngyǔ normalized into 18 semantic categories, we compare three classification approaches: a rule-based keyword baseline, character n-gram TF-IDF with logistic regression, and BERT-base-chinese. The results show that lexical mediators cannot be directly equated with semantic categories and that TF-IDF achieves the best overall performance, suggesting that lightweight character-level representations remain robust for very short idioms in low-resource settings. The study contributes an interpretable RDF/OWL-compatible resource for culture-aware modeling of Chinese idioms.
Search
Co-authors
    Venues
    Fix author