Wenxuan Zhang

Other people with similar names: Wenxuan Zhang

Unverified author pages with similar names: Wenxuan Zhang

2026

Language of Thought Shapes Output Diversity in Large Language Models
Shaoyang Xu | Wenxuan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Output diversity is crucial for Large Language Models as it underpins pluralism and creativity.In this work, we reveal that controlling the language used during model thinking—the *language of thought*—provides a novel and structural source of output diversity.Our preliminary study shows that different thinking languages occupy distinct regions in a model’s thinking space.Based on this observation, we study two repeated sampling strategies under multilingual thinking—*Single-Language Sampling* and *Mixed-Language Sampling*—and conduct diversity evaluation on outputs that are controlled to be in English, regardless of the thinking language used.Across extensive experiments, we demonstrate that switching the thinking language from English to non-English languages consistently increases output diversity, with a clear and consistent positive correlation such that languages farther from English in the thinking space yield larger gains.We further show that aggregating samples across multiple thinking languages yields additional improvements through compositional effects, and that scaling sampling with linguistic heterogeneity expands the model’s diversity ceiling.Finally, we show that these findings translate into practical benefits in pluralistic alignment scenarios, leading to broader coverage of cultural knowledge and value orientations in LLM outputs. Our code is publicly available at https://github.com/iNLP-Lab/Multilingual-LoT-Diversity.

pdf bib abs

DR-Arena: an Automated Evaluation Framework for Deep Research Agents
Yiwen Gao | Ruochen Zhao | Yang Deng | Wenxuan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.

Co-authors

Venues

ACL2

Fix author