Huacheng Song


2025

pdf bib
Reasoning or Memorization? Investigating LLMs’ Capability in Restoring Chinese Internet Homophones
Jianfei Ma | Zhaoxin Feng | Huacheng Song | Emmanuele Chersoni | Zheng Chen
Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)

Chinese homophones, prevalent in Internet culture, bring rich linguistic twists that are challenging for language models. While native speakers disambiguate them through phonological reasoning and contextual understanding, it remains untested how well LLMs perform on this task and whether LLMs also achieve this via similar reasoning processes or merely through memorization of homophone-original word pairs during training.In this paper, we present HomoP-CN, the first Chinese Internet homophones dataset with systematic perturbations for evaluating LLMs’ homophone restoration capabilities. Using this benchmark, we investigated the influence of semantic, phonological, and graphemic features on LLMs’ restoration accuracy, measured the reliance levels of each model on memorization during restoration through consistency ratios under controlled perturbations, and assessed the effectiveness of various prompting strategies, including contextual cues, pinyin augmentation, few-shot learning, and thought-chain approaches.

2024

pdf bib
A Deep Analysis of the Impact of Multiword Expressions and Named Entities on Chinese-English Machine Translations
Huacheng Song | Hongzhi Xu
Findings of the Association for Computational Linguistics: EMNLP 2024

In this paper, we present a study on the impact of so-called multiword expressions (MWEs) and multiword named entities (NEs) on the performance of Chinese-English machine translation (MT) systems. Built on an extended version of the data from the WMT22 Metrics Shared Task (with extra labels of 9 types of Chinese MWEs, and 19 types of Chinese multiword NEs) which includes scores and error annotations provided by human experts, we make further extraction of MWE- and NE-related translation errors. By investigating the human evaluation scores and the error rates on each category of MWEs and NEs, we find that: 1) MT systems tend to perform significantly worse on Chinese sentences with most kinds of MWEs and NEs; 2) MWEs and NEs which make up of about twenty percent of tokens, i.e. characters in Chinese, result in one-third of translation errors; 3) for 13 categories of MWEs and NEs, the error rates exceed 50% with the highest to be 84.8%. Based on the results, we emphasize that MWEs and NEs are still a bottleneck issue for MT and special attention to MWEs and NEs should be paid to further improving the performance of MT systems.

pdf bib
Benchmarking the Performance of Machine Translation Evaluation Metrics with Chinese Multiword Expressions
Huacheng Song | Hongzhi Xu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

To investigate the impact of Multiword Expressions (MWEs) on the fine-grained performance of the state-of-the-art metrics for Machine Translation Evaluation (MTE), we conduct experiments on the WMT22 Metrics Shared Task dataset with a preliminary focus on the Chinese-to-English language pair. We further annotate 28 types of Chinese MWEs on the source texts and then examine the performance of 31 MTE metrics on groups of sentences containing different MWEs. We have 3 interesting findings: 1) Machine Translation (MT) systems tend to perform worse on most Chinese MWE categories, confirming the previous claim that MWEs are a bottleneck of MT; 2) automatic metrics tend to overrate the translation of sentences containing MWEs; 3) most neural-network-based metrics perform better than string-overlap-based metrics. It concludes that both MT systems and MTE metrics still suffer from MWEs, suggesting richer annotation of data to facilitate MWE-aware automatic MTE and MT.

pdf bib
How Grammatical Features Impact Machine Translation: A New Test Suite for Chinese-English MT Evaluation
Huacheng Song | Yi Li | Yiwen Wu | Yu Liu | Jingxia Lin | Hongzhi Xu
Proceedings of the Ninth Conference on Machine Translation

Machine translation (MT) evaluation has evolved toward a trend of fine-grained granularity, enabling a more precise diagnosis of hidden flaws and weaknesses of MT systems from various perspectives. This paper examines how MT systems are potentially affected by certain grammatical features, offering insights into the challenges these features pose and suggesting possible directions for improvement. We develop a new test suite by extracting 7,848 sentences from a multi-domain Chinese-English parallel corpus. All the Chinese text was further annotated with 43 grammatical features using a semi-automatic method. This test suite was subsequently used to evaluate eight state-of-the-art MT systems according to six different automatic evaluation metrics. The results reveal intriguing patterns of MT performance associated with different domains and various grammatical features, highlighting the test suite’s effectiveness. The test suite was made publicly available and it will serve as an important benchmark for evaluating and diagnosing Chinese-English MT systems.