Shumin Shi
Also published as: 树敏 史
2026
Would LLMs be Good Historical Linguists and Chinese Dialect Learners?
Yicheng Liu | Shumin Shi | Youchao Zhou | Xingchen Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yicheng Liu | Shumin Shi | Youchao Zhou | Xingchen Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) perform well on Standard Chinese but struggle with low-resource Chinese dialects due to substantial phonological divergence. We investigate whether incorporating Middle Chinese, the common historical ancestor of most of the modern Chinese dialects, can improve dialectal pronunciation modeling in a linguistically interpretable manner. We focus on two specific task variants: (1) conditional sound change rule induction (a variant of Sound Law Induction, SLI), where models infer executable phonological transformation rules from Middle Chinese to modern dialects, and (2) sentence-level dialectal pronunciation transcription (a variant of Grapheme-to-Phoneme, G2P), requiring dialect-specific International Phonetic Alphabet (IPA) generation. We construct a multi-source dataset covering Middle Chinese and 12 modern Chinese dialects, including character-level correspondences, rule exemplars, and sentence-level IPA transcription. We adopt a parameter-efficient training framework combining LoRA-based supervised fine-tuning and reinforcement learning via Group Relative Policy Optimization (GRPO) for the first task. Across both tasks and a wide range of dialects and evaluation metrics, our approach achieves overall improvements over strong baselines, including DeepSeek-V3.2 and ChatGPT-5.2, while revealing variation across dialects. These results demonstrate the value of leveraging historical linguistic knowledge for modeling low-resource Chinese dialects.
2023
CCL23-Eval 任务9系统报告:基于重叠片段生成增强阅读理解模型鲁棒性的方法(System Report for CCL23-Eval Task 9: Improving MRC Robustness with Overlapping Segments Generation for GCRC_advRobust)
Suzhe He (何苏哲) | Chongsheng Yang (杨崇盛) | Shumin Shi (史树敏)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
Suzhe He (何苏哲) | Chongsheng Yang (杨崇盛) | Shumin Shi (史树敏)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
“目前机器阅读理解在抽取语义完整的选项证据时存在诸多挑战。现有通过无监督方式进行证据抽取的工作主要分为两类,一是利用静态词向量,采用集束搜索迭代地提取相关句子;另一类是使用实例级监督方法,包括独立式证据抽取和端到端式证据抽取。前者处理流程上较为繁琐,后者在联合训练时存在不稳定性,直接导致模型性能难以稳定提升。在CCL23-Eval 任务9中,本文提出了一种基于重叠片段生成的自适应端到端证据抽取方法。该方法针对证据句边界不明确的问题,通过将文档划分为多个重叠的句子片段,并提取关键部分作为证据来实现整体语义的抽取。同时,将证据提取嵌入模块予以优化,实现了证据片段置信度自动调整。实验结果表明本文所提出方法能够极大地排除冗余内容干扰,仅需一个超参数即可稳定提升阅读理解模型性能,增强了模型鲁棒性。”
2017
A Parallel Recurrent Neural Network for Language Modeling with POS Tags
Chao Su | Heyan Huang | Shumin Shi | Yuhang Guo | Hao Wu
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation
Chao Su | Heyan Huang | Shumin Shi | Yuhang Guo | Hao Wu
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation
QLUT at SemEval-2017 Task 2: Word Similarity Based on Word Embedding and Knowledge Base
Fanqing Meng | Wenpeng Lu | Yuteng Zhang | Ping Jian | Shumin Shi | Heyan Huang
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
Fanqing Meng | Wenpeng Lu | Yuteng Zhang | Ping Jian | Shumin Shi | Heyan Huang
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
This paper shows the details of our system submissions in the task 2 of SemEval 2017. We take part in the subtask 1 of this task, which is an English monolingual subtask. This task is designed to evaluate the semantic word similarity of two linguistic items. The results of runs are assessed by standard Pearson and Spearman correlation, contrast with official gold standard set. The best performance of our runs is 0.781 (Final). The techniques of our runs mainly make use of the word embeddings and the knowledge-based method. The results demonstrate that the combined method is effective for the computation of word similarity, while the word embeddings and the knowledge-based technique, respectively, needs more deeply improvement in details.