Zhifan Sun
2026
XToM: Exploring the Multilingual Theory of Mind for Large Language Models
Chunkit Chan | Yauwai Yim | Hongchuan Zeng | Zhiying Zou | Xinyuan Cheng | Zhifan Sun | Zheye Deng | Kawai Chung | Yuzhuo Ao | Fan Yixiang | Cheng Jiayang | Ercong Nie | Ginny Wong | Helmut Schmid | Hinrich Schuetze | Simon See | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chunkit Chan | Yauwai Yim | Hongchuan Zeng | Zhiying Zou | Xinyuan Cheng | Zhifan Sun | Zheye Deng | Kawai Chung | Yuzhuo Ao | Fan Yixiang | Cheng Jiayang | Ercong Nie | Ginny Wong | Helmut Schmid | Hinrich Schuetze | Simon See | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Theory of Mind (ToM)—the ability to infer mental states in others—is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind—the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs’ ability to replicate human-like mentalizing across linguistic contexts.
2024
A Test Suite of Prompt Injection Attacks for LLM-based Machine Translation
Antonio Valerio Miceli Barone | Zhifan Sun
Proceedings of the Ninth Conference on Machine Translation
Antonio Valerio Miceli Barone | Zhifan Sun
Proceedings of the Ninth Conference on Machine Translation
LLM-based NLP systems typically work by embedding their input data into prompt templates which contain instructions and/or in-context examples, creating queries which are submitted to a LLM, then parse the LLM response in order to generate the system outputs. Prompt Injection Attacks (PIAs) are a type of subversion of these systems where a malicious user crafts special inputs which interfer with the prompt templates, causing the LLM to respond in ways unintended by the system designer.Recently, Sun and Miceli-Barone (2024) proposed a class of PIAs against LLM-based machine translation. Specifically, the task is to translate questions from the TruthfulQA test suite, where an adversarial prompt is prepended to the questions, instructing the system to ignore the translation instruction and answer the questions instead.In this test suite we extend this approach to all the language pairs of the WMT 2024 General Machine Translation task. Moreover, we include additional attack formats in addition to the one originally studied.
Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
Zhifan Sun | Antonio Valerio Miceli-Barone
Proceedings of the First edition of the Workshop on the Scaling Behavior of Large Language Models (SCALE-LLM 2024)
Zhifan Sun | Antonio Valerio Miceli-Barone
Proceedings of the First edition of the Workshop on the Scaling Behavior of Large Language Models (SCALE-LLM 2024)
Large Language Models (LLMs) are increasingly becoming the preferred foundation platforms for many Natural Language Processing tasks such as Machine Translation, owing to their quality often comparable to or better than task-specific models, and the simplicity of specifying the task through natural language instructions or in-context examples.Their generality, however, opens them up to subversion by end users who may embed into their requests instructions that cause the model to behave in unauthorized and possibly unsafe ways.In this work we study these Prompt Injection Attacks (PIAs) on multiple families of LLMs on a Machine Translation task, focusing on the effects of model size on the attack success rates.We introduce a new benchmark data set and we discover that on multiple language pairs and injected prompts written in English, larger models under certain conditions may become more susceptible to successful attacks, an instance of the Inverse Scaling phenomenon (McKenzie et al., 2023).To our knowledge, this is the first work to study non-trivial LLM scaling behaviour in a multi-lingual setting.