Don’t Want Your LLM to Recommend Nuclear Strike? Try Asking It in Japanese

Rian Touchent

Don’t Want Your LLM to Recommend Nuclear Strike? Try Asking It in Japanese

Abstract

Large language models are increasingly used in strategic and advisory contexts, yet their safety alignment is typically evaluated in English only. We test nine models from six providers and ask whether the language of a prompt can change a model’s decision in a high-stakes scenario. We use single-turn game-theoretic vignettes in which a model advises a nuclear-armed nation on whether to strike a defenseless opponent. The prompt is intentionally amoral and strategically identical across languages. We find that Japanese prompts reduce launch rates in the Claude model family: Claude Sonnet 4.6 drops from 40% to 0% in scenarios where the strike is unnecessary and from 93% to 17% in contested scenarios, with minimal effect when the strike is strategically rational. The effect extends to Gemini Pro 3.1 (53% to 13%). A cross-language experiment isolates the mechanism: when instructed to reason in Japanese in an English prompt, launch rates drop from 93% to 37%. It is the language the model is asked to reason in, not the language of the input, that drives the effect. When reasoning in Japanese, models spontaneously generate moral vocabulary ("moral cost", "millions of lives") that is entirely absent from the prompt. Five other models show no language effect, but they launch in nearly every condition regardless of language. The effect requires a model that already hesitates in English. These results show that LLM safety behavior is language-dependent, and that evaluating in English alone can miss both risks and safeguards encoded in other languages.

Anthology ID:: 2026.trustnlp-main.35
Volume:: Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:: July
Year:: 2026
Address:: San Diego, California
Editors:: Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 489–502
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.35/
DOI:
Bibkey:
Cite (ACL):: Rian Touchent. 2026. Don’t Want Your LLM to Recommend Nuclear Strike? Try Asking It in Japanese. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 489–502, San Diego, California. Association for Computational Linguistics.
Cite (Informal):: Don’t Want Your LLM to Recommend Nuclear Strike? Try Asking It in Japanese (Touchent, TrustNLP 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.35.pdf

PDF Cite Search Fix data