Xinyue Ma
2026
Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation
Xinyue Ma | Pol Pastells | Mireia Farrus | Mariona Taule
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Xinyue Ma | Pol Pastells | Mireia Farrus | Mariona Taule
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage of the source language, and therefore tend to maintain the passive voice when translating a passive in either direction. However, models demonstrate some knowledge of the low frequency and predominantly negative context of Chinese passives, leading to higher voice consistency with human translators in English-to-Chinese translation than in Chinese-to-English translation. Commercial NMT models scored higher in metric evaluations, but LLMs showed a better ability to use diverse alternative translations. Datasets and annotation script will be shared upon request.
2025
Semantic Prosody in Machine Translation: the English-Chinese Case of Passive Structures
Xinyue Ma | Pol Pastells | Mariona Taulé Delor | Mireia Farrús
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
Xinyue Ma | Pol Pastells | Mariona Taulé Delor | Mireia Farrús
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
Semantic prosody is a collocational meaning formed through the co-occurrence of a linguistic unit and a consistent series of collocates, which should be treated separately from semantic meaning. Since words that are literal translation of each other may have different semantic prosody, more attention should be paid to this linguistic property in order to generate accurate translation. However, current machine translation models cannot handle this problem. To bridge the gap, we propose an approach to teach machine translation models about semantic prosody of a specific structure. We focus on Chinese BEI passives and create a dataset of English-Chinese sentence pairs with the purpose of demonstrating the negative semantic prosody of BEI passives. Then we fine-tune OPUS-MT, NLLB-600M and mBART50-mmt models with our dataset for the English-Chinese translation task. Our results show that fine-tuned MT models perform better on using BEI passives for translating unfavourable content and avoid using it for neutral and favourable content. Also, in NLLB-600M, which is a multilingual model, this knowledge of semantic prosody can be transferred from English-Chinese translation to other language pairs, such as Spanish-Chinese.