@inproceedings{sun-etal-2025-beyond,
    title = "Beyond Reactive Safety: Risk-Aware {LLM} Alignment via Long-Horizon Simulation",
    author = "Sun, Chenkai  and
      Zhang, Denghui  and
      Zhai, ChengXiang  and
      Ji, Heng",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2025.findings-acl.332/",
    doi = "10.18653/v1/2025.findings-acl.332",
    pages = "6422--6434",
    ISBN = "979-8-89176-256-5",
    abstract = "Given the growing influence of language model-based agents on high-stakes societal decisions, from public policy to healthcare, ensuring their beneficial impact requires understanding the far-reaching implications of their suggestions. We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems on a macroscopic scale over time, enabling more robust alignment. To assess the long-term safety awareness of language models, we also introduce a dataset of 100 indirect harm scenarios, testing models' ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts. Our approach achieves not only over 20{\%} improvement on the new dataset but also an average win rate exceeding 70{\%} against strong baselines on existing safety benchmarks (AdvBench, SafeRLHF, WildGuardMix), suggesting a promising direction for safer agents."
}Markdown (Informal)
[Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation](https://preview.aclanthology.org/ingest-emnlp/2025.findings-acl.332/) (Sun et al., Findings 2025)
ACL