Wenjie Hua

2025

pdf bib abs
The Historian’s Fingerprint: A Computational Stylometric Study of the Zuo Commentary and Discourses of the States
Wenjie Hua
Proceedings of the Second Workshop on Ancient Language Processing

Previous studies suggest that authorship can be inferred through stylistic features like func-tion word usage and grammatical patterns, yet such analyses remain limited for Old Chinese texts with disputed authorship. Computational methods enable a more nuanced exploration of these texts. This study applies stylometric anal-ysis to examine the authorship controversy be-tween the Zuo Commentary and the Discourses of the States. Using PoS 4-grams, Kullback-Leibler divergence, and multidimensional scal-ing (MDS), we systematically compare their stylistic profiles. Results show that the Zuo Commentary exhibits high internal consistency, especially in the later eight Dukes chapters, supporting its integration by a single scholarly tradition. In contrast, the Discourses of the States displays greater stylistic diversity, align-ing with the multiple-source compilation the-ory. Further analysis reveals partial stylistic similarities among the Lu, Jin, and Chu-related chapters, suggesting shared influences. These findings provide quantitative support for Tong Shuye’s arguments and extend statistical vali-dation of Bernhard Karlgren’s assertion on the textual unity of the Zuo Commentary.

pdf bib abs
When Less Is More: Logits-Constrained Framework with RoBERTa for Ancient Chinese NER
Wenjie Hua | Shenghan Xu
Proceedings of the Second Workshop on Ancient Language Processing

This report presents our team’s work on ancient Chinese Named Entity Recognition (NER) for EvaHan 20251. We propose a two-stage framework combining GujiRoBERTa with a Logits-Constrained (LC) mechanism. The first stage generates contextual embeddings using GujiRoBERTa, followed by dynamically masked decoding to enforce valid BMES transitions. Experiments on EvaHan 2025 datasets demonstrate the framework’s effectiveness. Key findings include the LC framework’s superiority over CRFs in high-label scenarios and the detrimental effect of BiLSTM modules. We also establish empirical model selection guidelines based on label complexity and dataset size.

Co-authors

Shenghan Xu 1

Venues

alp2
ws2

Fix data