Ying Nian Wu
2026
Gold-Medal-Level Olympiad Geometry Solving with Efficient Heuristic Auxiliary Constructions
Boyan Duan | Xiao Liang | Shuai Lu | Yaoxiang Wang | Yelong Shen | Kai-Wei Chang | Ying Nian Wu | Mao Yang | Weizhu Chen | Yeyun Gong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Boyan Duan | Xiao Liang | Shuai Lu | Yaoxiang Wang | Yelong Shen | Kai-Wei Chang | Ying Nian Wu | Mao Yang | Weizhu Chen | Yeyun Gong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automated theorem proving in Euclidean geometry, particularly for International Mathematical Olympiad (IMO) level problems, remains a major challenge and an important research focus in Artificial Intelligence. In this paper, we present a highly efficient method for geometry theorem proving that runs entirely on CPUs without relying on neural network–based inference. Our initial study shows that a simple random strategy for adding auxiliary points can achieve ”silver-medal” level human performance on IMO. Building on this, we propose HAGeo, a Heuristic-based method for adding Auxiliary points in Geometric deduction that solves 28 of 30 problems on the IMO-30 benchmark, achieving “gold-medal” level performance and surpassing AlphaGeometry, a competitive neural network–based approach, by a notable margin. To evaluate our method and existing approaches more comprehensively, we further construct HAGeo, a benchmark consisting of 409 geometry problems with human-assessed difficulty levels. Compared with the widely used IMO-30, our benchmark poses greater challenges and provides a more precise evaluation, setting a higher bar for geometry theorem proving.
Dynamic Generation of Multi LLM Agents Communication Topologies with Graph Diffusion Models
Eric Hanchen Jiang | Levina Li | Frank Wan | Xiao Liang | Sophia Yin | Yuchen Wu | Xinfeng Li | Yizhou Sun | Wei Wang | Kai-Wei Chang | Ying Nian Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Eric Hanchen Jiang | Levina Li | Frank Wan | Xiao Liang | Sophia Yin | Yuchen Wu | Xinfeng Li | Yizhou Sun | Wei Wang | Kai-Wei Chang | Ying Nian Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called Guided Topology Diffusion (GTD). Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration. Our code is available at https://anonymous.4open.science/r/diffusion_agent-953C.
Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy
Eric Hanchen Jiang | Weixuan Ou | Run Liu | Shengyuan Pang | Guancheng Wan | Ranjie Duan | Wei Dong | Kai-Wei Chang | XiaoFeng Wang | Ying Nian Wu | Xinfeng Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Eric Hanchen Jiang | Weixuan Ou | Run Liu | Shengyuan Pang | Guancheng Wan | Ranjie Duan | Wei Dong | Kai-Wei Chang | XiaoFeng Wang | Ying Nian Wu | Xinfeng Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Safety alignment of large language models currently faces a central challenge: existing alignment techniques often prioritize mitigating responses to harmful prompts at the expense of overcautious behavior, leading models to incorrectly refuse benign requests. A key goal of safe alignment is therefore to improve safety while simultaneously minimizing false refusals. In this work, we introduce Energy Landscape Steering (ELS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, the EBM maps the LLM’s internal activations to an energy landscape, and we use the gradient of the energy function to steer the hidden states toward low-energy regions in real time. This dynamically guides the model toward desirable behavior without modifying its parameters. By decoupling behavioral control from the model’s core knowledge, ELS provides a flexible and computationally efficient solution. Extensive experiments across diverse models demonstrate its effectiveness: raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work establishes a promising paradigm for building LLMs that simultaneously achieve high safety and low false refusal rates.
Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability
Xiao Liang | Zhong-Zhi Li | Zhenghao Lin | Eric Hanchen Jiang | Hengyuan Zhang | Yelong Shen | Kai-Wei Chang | Ying Nian Wu | Yeyun Gong | Weizhu Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiao Liang | Zhong-Zhi Li | Zhenghao Lin | Eric Hanchen Jiang | Hengyuan Zhang | Yelong Shen | Kai-Wei Chang | Ying Nian Wu | Yeyun Gong | Weizhu Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution space. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model’s capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs’ reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original problem conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training settings, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks. The code is available at the [provided link](https://github.com/MasterVito/DAC-RL).
Why Are We Moral? An LLM-based Agent Simulation Approach to the Study of Moral Evolution
Zhou Ziheng | Huacong Tang | Mingjie Bi | Wanying He | Fang Sun | Yizhou Sun | Ying Nian Wu | Demetri Terzopoulos | Yipeng Kang | Fangwei Zhong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhou Ziheng | Huacong Tang | Mingjie Bi | Wanying He | Fang Sun | Yizhou Sun | Ying Nian Wu | Demetri Terzopoulos | Yipeng Kang | Fangwei Zhong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The evolution of morality presents a puzzle: natural selection should favor self-interest, yet humans developed moral systems promoting altruism. Traditional approaches must abstract away cognitive processes, leaving open how cognitive factors shape moral evolution. We introduce an LLM-based agent simulation framework that brings cognitive realism to this question: agents with varying moral dispositions perceive, remember, reason, and decide in a simulated prehistoric hunter-gatherer society. This enables us to manipulate factors that traditional models cannot represent—such as moral type observability and communication bandwidth—and to discover emergent cognitive mechanisms from agent interactions. Across 20 runs spanning four settings, we find that cooperation and mutual help are the central driver of evolutionary survival, with universal and reciprocal morality exhibiting the most stable outcomes across conditions while selfishness is strongly disfavoured. Beyond cooperation itself, we further identify cognition as a central mediator—most clearly through a cost of moral judgment that shifts the winning moral type across settings, with a self-purging effect among selfish agents as an additional cognitive pattern. We validate robustness across multiple LLM backbones, architecture ablations, and prompt sensitivity analyses. This work establishes LLM-based simulation as a powerful new paradigm to complement traditional research in evolutionary biology and anthropology, opening new avenues for investigating the complexities of moral and social evolution.
2025
Explore the Reasoning Capability of LLMs in the Chess Testbed
Shu Wang | Lei Ji | Renxi Wang | Wenxiao Zhao | Haokun Liu | Yifan Hou | Ying Nian Wu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Shu Wang | Lei Ji | Renxi Wang | Wenxiao Zhao | Haokun Liu | Yifan Hou | Ying Nian Wu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Reasoning is a central capability of human intelligence. In recent years, with the advent of large-scale datasets, pretrained large language models have emerged with new capabilities, including reasoning. However, these models still struggle with long-term, complex reasoning tasks, such as playing chess. Based on the observation that expert chess players employ a dual approach combining long-term strategic play with short-term tactical play along with language explanation, we propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic. Specifically, we collect a dataset named MATE, which consists of 1 million chess positions with candidate moves annotated for strategy and tactics. We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves. Our experiments show that our models perform better than GPT, Claude, and Gemini models. We find that language explanations can enhance the reasoning capability of large language models.
On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models
Tianyang Zhao | Kunwar Yashraj Singh | Srikar Appalaraju | Peng Tang | Ying Nian Wu | Li Erran Li
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Tianyang Zhao | Kunwar Yashraj Singh | Srikar Appalaraju | Peng Tang | Ying Nian Wu | Li Erran Li
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
A small subset of dimensions within language Transformers’ representation spaces emerge as “outliers” during pretraining, encoding critical knowledge sparsely. We extend previous findings on emergent outliers to Encoder-Decoder Transformers and instruction-finetuned models, and tackle the problem of distilling a student Transformer from a larger teacher Transformer. Knowledge distillation reduces model size and cost by transferring knowledge from a larger teacher to a smaller student, necessitating a trade-off among representation dimensions. We show that emergent outlier dimensions contribute significantly more to zero-shot performance than non-outlier dimensions. Based on this, we propose the Emergent Outlier Focused Distillation (EOFD) method, which prioritizes critical outlier dimensions in distillation using a weighted MSE loss. We empirically demonstrate that EOFD outperforms state-of-the-art distillation methods and generalizes well across Encoder-only BERT, Decoder-only GPT-2, and Encoder-Decoder T5 architectures.
Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts
Jingxuan Li | Yuning Yang | Shengqi Yang | Linfan Zhang | Ying Nian Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingxuan Li | Yuning Yang | Shengqi Yang | Linfan Zhang | Ying Nian Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The recent progress in Vision-Language Models (VLMs) has broadened the scope of multimodal applications. However, evaluations often remain limited to functional tasks, neglecting abstract dimensions such as personality traits and human values. To address this gap, we introduce Value-Spectrum, a novel Visual Question Answering (VQA) benchmark aimed at assessing VLMs based on Schwartz’s value dimensions that capture core human values guiding people’s preferences and actions. We design a VLM agent pipeline to simulate video browsing and construct a vector database comprising over 50,000 short videos from TikTok, YouTube Shorts, and Instagram Reels. These videos span multiple months and cover diverse topics, including family, health, hobbies, society, technology, etc. Benchmarking on Value-Spectrum highlights notable variations in how VLMs handle value-oriented content. Beyond identifying VLMs’ intrinsic preferences, we also explore the ability of VLM agents to adopt specific personas when explicitly prompted, revealing insights into the adaptability of the model in role-playing scenarios. These findings highlight the potential of Value-Spectrum as a comprehensive evaluation set for tracking VLM preferences in value-based tasks and abilities to simulate diverse personas. The complete code and data are available at https://github.com/Jeremyyny/Value-Spectrum.
2021
SCRIPT: Self-Critic PreTraining of Transformers
Erik Nijkamp | Bo Pang | Ying Nian Wu | Caiming Xiong
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Erik Nijkamp | Bo Pang | Ying Nian Wu | Caiming Xiong
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
We introduce Self-CRItic Pretraining Transformers (SCRIPT) for representation learning of text. The popular masked language modeling (MLM) pretraining methods like BERT replace some tokens with [MASK] and an encoder is trained to recover them, while ELECTRA trains a discriminator to detect replaced tokens proposed by a generator. In contrast, we train a language model as in MLM and further derive a discriminator or critic on top of the encoder without using any additional parameters. That is, the model itself is a critic. SCRIPT combines MLM training and discriminative training for learning rich representations and compute- and sample-efficiency. We demonstrate improved sample-efficiency in pretraining and enhanced representations evidenced by improved downstream task performance on GLUE and SQuAD over strong baselines. Also, the self-critic scores can be directly used as pseudo-log-likelihood for efficient scoring.
Generative Text Modeling through Short Run Inference
Bo Pang | Erik Nijkamp | Tian Han | Ying Nian Wu
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Bo Pang | Erik Nijkamp | Tian Han | Ying Nian Wu
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Latent variable models for text, when trained successfully, accurately model the data distribution and capture global semantic and syntactic features of sentences. The prominent approach to train such models is variational autoencoders (VAE). It is nevertheless challenging to train and often results in a trivial local optimum where the latent variable is ignored and its posterior collapses into the prior, an issue known as posterior collapse. Various techniques have been proposed to mitigate this issue. Most of them focus on improving the inference model to yield latent codes of higher quality. The present work proposes a short run dynamics for inference. It is initialized from the prior distribution of the latent variable and then runs a small number (e.g., 20) of Langevin dynamics steps guided by its posterior distribution. The major advantage of our method is that it does not require a separate inference model or assume simple geometry of the posterior distribution, thus rendering an automatic, natural and flexible inference engine. We show that the models trained with short run dynamics more accurately model the data, compared to strong language model and VAE baselines, and exhibit no sign of posterior collapse. Analyses of the latent space show that interpolation in the latent space is able to generate coherent sentences with smooth transition and demonstrate improved classification over strong baselines with latent features from unsupervised pretraining. These results together expose a well-structured latent space of our generative model.
Robust Transfer Learning with Pretrained Language Models through Adapters
Wenjuan Han | Bo Pang | Ying Nian Wu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Wenjuan Han | Bo Pang | Ying Nian Wu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Transfer learning with large pretrained transformer-based language models like BERT has become a dominating approach for most NLP tasks. Simply fine-tuning those large language models on downstream tasks or combining it with task-specific pretraining is often not robust. In particular, the performance considerably varies as the random seed changes or the number of pretraining and/or fine-tuning iterations varies, and the fine-tuned model is vulnerable to adversarial attack. We propose a simple yet effective adapter-based approach to mitigate these issues. Specifically, we insert small bottleneck layers (i.e., adapter) within each layer of a pretrained model, then fix the pretrained layers and train the adapter layers on the downstream task data, with (1) task-specific unsupervised pretraining and then (2) task-specific supervised training (e.g., classification, sequence labeling). Our experiments demonstrate that such a training scheme leads to improved stability and adversarial robustness in transfer learning to various downstream tasks.
SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues
Liang Qiu | Yuan Liang | Yizhou Zhao | Pan Lu | Baolin Peng | Zhou Yu | Ying Nian Wu | Song-Chun Zhu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Liang Qiu | Yuan Liang | Yizhou Zhao | Pan Lu | Baolin Peng | Zhou Yu | Ying Nian Wu | Song-Chun Zhu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Inferring social relations from dialogues is vital for building emotionally intelligent robots to interpret human language better and act accordingly. We model the social network as an And-or Graph, named SocAoG, for the consistency of relations among a group and leveraging attributes as inference cues. Moreover, we formulate a sequential structure prediction task, and propose an 𝛼-𝛽-𝛾 strategy to incrementally parse SocAoG for the dynamic inference upon any incoming utterance: (i) an 𝛼 process predicting attributes and relations conditioned on the semantics of dialogues, (ii) a 𝛽 process updating the social relations based on related attributes, and (iii) a 𝛾 process updating individual’s attributes based on interpersonal social relations. Empirical results on DialogRE and MovieGraph show that our model infers social relations more accurately than the state-of-the-art methods. Moreover, the ablation study shows the three processes complement each other, and the case study demonstrates the dynamic relational inference.
Search
Fix author
Co-authors
- Kai-Wei Chang 4
- Eric Hanchen Jiang 3
- Xiao Liang (梁霄) 3
- Bo Pang 3
- Weizhu Chen 2
- Yeyun Gong 2
- Xinfeng Li 2
- Erik Nijkamp 2
- Yelong Shen 2
- Yizhou Sun 2
- Srikar Appalaraju 1
- Mingjie Bi 1
- Wei Dong 1
- Boyan Duan 1
- Ranjie Duan 1
- Tian Han 1
- Wenjuan Han 1
- Wanying He 1
- Yifan Hou 1
- Lei Ji 1
- Yipeng Kang 1
- Jingxuan Li 1
- Levina Li 1
- Li Erran Li 1
- Zhong-Zhi Li 1
- Yuan Liang 1
- Zhenghao Lin 1
- Haokun Liu 1
- Run Liu 1
- Pan Lu 1
- Shuai Lu 1
- Weixuan Ou 1
- Shengyuan Pang 1
- Baolin Peng 1
- Liang Qiu 1
- Kunwar Yashraj Singh 1
- Fang Sun 1
- Huacong Tang 1
- Peng Tang 1
- Demetri Terzopoulos 1
- Frank Wan 1
- Guancheng Wan 1
- Renxi Wang 1
- Shu Wang 1
- Wei Wang 1
- XiaoFeng Wang 1
- Yaoxiang Wang 1
- Yuchen Wu 1
- Caiming Xiong 1
- Mao Yang 1
- Shengqi Yang 1
- Yuning Yang 1
- Sophia Yin 1
- Zhou Yu 1
- Hengyuan Zhang 1
- Linfan Zhang 1
- Tianyang Zhao 1
- Wenxiao Zhao 1
- Yizhou Zhao 1
- Fangwei Zhong 1
- Song-chun Zhu 1
- Zhou Ziheng 1