Nuo Chen
Other people with similar names: Nuo Chen
Unverified author pages with similar names: Nuo Chen
2026
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration
Nuo Chen | Andre Lin HuiKai | Jiaying Wu | Junyi Hou | Zining Zhang | Qian Wang | Xidong Wang | Bingsheng He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Nuo Chen | Andre Lin HuiKai | Jiaying Wu | Junyi Hou | Zining Zhang | Qian Wang | Xidong Wang | Bingsheng He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited in supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, for example, maintaining conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process that is not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision, centered on criteria-guided intent alignment and context-aware modeling. To validate the framework, we curate a dataset of 7,000 research papers from top-tier venues, annotated with 140,000 instruction–response pairs that reflect realistic, section-level scientific revisions. We instantiate the framework in XtraGPT, the first suite of open-source LLMs (1.5B to 14B parameters) specifically fine-tuned for context-aware academic paper revision. Extensive experiments show that XtraGPT significantly outperforms same-scale baselines and rivals the quality of proprietary counterparts. Both automated preference assessments and human evaluations confirm the effectiveness of XtraGPT in improving scientific drafts. Our code and models are available at https://github.com/Xtra-Computing/XtraGPT and https://huggingface.co/collections/Xtra-Computing/xtragpt.
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.
Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation
Nuo Chen | Yicheng Tong | Yuzhe Yang | Yufei He | Xueyi Zhang | Zou Qingyun | Qian Wang | Bingsheng He
Findings of the Association for Computational Linguistics: ACL 2026
Nuo Chen | Yicheng Tong | Yuzhe Yang | Yufei He | Xueyi Zhang | Zou Qingyun | Qian Wang | Bingsheng He
Findings of the Association for Computational Linguistics: ACL 2026
Multi-agent systems (MAS) are increasingly used for open-ended idea generation, driven by the expectation that collective interaction will broaden the exploration diversity. However, when and why such collaboration truly expands the solution space remains unclear. We present a systematic empirical study of diversity in MAS-based ideation across three bottom-up levels: model intelligence, agent cognition, and system dynamics. At the model level, we identify a compute efficiency paradox, where stronger, highly aligned models yield diminishing marginal diversity despite higher per-sample quality. At the cognition level, authority-driven dynamics suppress semantic diversity compared to junior-dominated groups. At the system level, group-size scaling yields diminishing returns and dense communication topologies accelerate premature convergence. We characterize these outcomes as collective failures emerging from structural coupling, a process where interaction inadvertently contracts agent exploration and triggers diversity collapse. Our analysis shows that this collapse arises primarily from the interaction structure rather than inherent model insufficiency, highlighting the importance of preserving independence and disagreement when designing MAS for creative tasks. Our code is available at https://github.com/Xtra-Computing/MAS_Diversity.
2025
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
Wentao Ge | Shunian Chen | Hardy Chen | Nuo Chen | Junying Chen | Zhihong Chen | Wenya Xie | Shuo Yan | Chenghao Zhu | Ziyue Lin | Dingjie Song | Xidong Wang | Anningzhe Gao | Zhang Zhiyi | Jianquan Li | Xiang Wan | Benyou Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Wentao Ge | Shunian Chen | Hardy Chen | Nuo Chen | Junying Chen | Zhihong Chen | Wenya Xie | Shuo Yan | Chenghao Zhu | Ziyue Lin | Dingjie Song | Xidong Wang | Anningzhe Gao | Zhang Zhiyi | Jianquan Li | Xiang Wan | Benyou Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating objective queries without considering real-world user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 26 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria.
Is Your LLM Outdated? A Deep Look at Temporal Generalization
Chenghao Zhu | Nuo Chen | Yufei Gao | Yunyi Zhang | Prayag Tiwari | Benyou Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Chenghao Zhu | Nuo Chen | Yufei Gao | Yunyi Zhang | Prayag Tiwari | Benyou Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The rapid advancement of Large Language Models (LLMs) has led to the development of benchmarks that consider temporal dynamics, however, there remains a gap in understanding how well these models can generalize across temporal contexts due to the inherent dynamic nature of language and information. This paper introduces the concept of temporal generalization in LLMs, including bias in past and future generalizations. Then we introduce FreshBench, a new evaluation framework that employs fresh text and event prediction for assessing LLMs’ temporal adaptability, ensuring the evaluation process free from data leakage and subjective bias. The experiment shows significant temporal biases and a decline in performance over time.
MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs
Qian Wang | Tianyu Wang | Zhenheng Tang | Qinbin Li | Nuo Chen | Jingsheng Liang | Bingsheng He
Findings of the Association for Computational Linguistics: ACL 2025
Qian Wang | Tianyu Wang | Zhenheng Tang | Qinbin Li | Nuo Chen | Jingsheng Liang | Bingsheng He
Findings of the Association for Computational Linguistics: ACL 2025
LLM-based multi-agent systems (MAS) have shown promise in tackling complex tasks. However, existing solutions often suffer from limited agent coordination and heavy reliance on predefined Standard Operating Procedures (SOPs), which demand extensive human input. To address these limitations, we propose MegaAgent, a large-scale autonomous LLM-based multi-agent system. MegaAgent generates agents based on task complexity and enables dynamic task decomposition, parallel execution, efficient communication, and comprehensive system monitoring of agents. In evaluations, MegaAgent demonstrates exceptional performance, successfully developing a Gobang game within 800 seconds and scaling up to 590 agents in a national policy simulation to generate multi-domain policies. It significantly outperforms existing systems, such as MetaGPT, in both task completion efficiency and scalability. By eliminating the need for predefined SOPs, MegaAgent demonstrates exceptional scalability and autonomy, setting a foundation for advancing true autonomy in MAS.
DRBO: Mitigating the Bottleneck Effect via Dynamic Reward Balancing in Multi-reward LLM Optimization
Nuo Chen | Yufei Gao | Yongnan Jin | Yan Hu | Anningzhe Gao | Lingyong Yan | Benyou Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Nuo Chen | Yufei Gao | Yongnan Jin | Yan Hu | Anningzhe Gao | Lingyong Yan | Benyou Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
In the current landscape of large language models (LLMs), many evaluation metrics have been developed and used as rewards during training to improve specific metrics. However, balancing these metrics and dynamically adjusting reward weights remains challenging, as current approaches often fail to enhance weaker metrics. To address this, we empirically propose a Dynamic Reward Balancing Optimization framework DRBO to mitigate the “bottleneck effect” by measuring performance, adjusting reward weights to prioritize weaker metrics, and optimizing the model via reinforcement learning. We apply DRBO to both single-task and multi-type task scenarios, validating its effectiveness in generation with citations and online shopping conversation tasks. The results demonstrate improved overall performance and balanced optimization across multiple metrics, effectively overcoming the diversity and complexity inherent in LLMs. Our codes are available at https://github.com/NuoJohnChen/DRBO.
2024
CryptoTrade: A Reflective LLM-based Agent to Guide Zero-shot Cryptocurrency Trading
Yuan Li | Bingqiao Luo | Qian Wang | Nuo Chen | Xu Liu | Bingsheng He
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Yuan Li | Bingqiao Luo | Qian Wang | Nuo Chen | Xu Liu | Bingsheng He
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The utilization of Large Language Models (LLMs) in financial trading has primarily been concentrated within the stock market, aiding in economic and financial decisions. Yet, the unique opportunities presented by the cryptocurrency market, noted for its on-chain data’s transparency and the critical influence of off-chain signals like news, remain largely untapped by LLMs. This work aims to bridge the gap by developing an LLM-based trading agent, CryptoTrade, which uniquely combines the analysis of on-chain and off-chain data. This approach leverages the transparency and immutability of on-chain data, as well as the timeliness and influence of off-chain signals, providing a comprehensive overview of the cryptocurrency market. CryptoTrade incorporates a reflective mechanism specifically engineered to refine its daily trading decisions by analyzing the outcomes of prior trading decisions. This research makes two significant contributions. Firstly, it broadens the applicability of LLMs to the domain of cryptocurrency trading. Secondly, it establishes a benchmark for cryptocurrency trading strategies. Through extensive experiments, CryptoTrade has demonstrated superior performance in maximizing returns compared to time-series baselines, but not compared to traditional trading signals, across various cryptocurrencies and market conditions. Our code and data are available at https://github.com/Xtra-Computing/CryptoTrade
2023
When Gradient Descent Meets Derivative-Free Optimization: A Match Made in Black-Box Scenario
Chengcheng Han | Liqing Cui | Renyu Zhu | Jianing Wang | Nuo Chen | Qiushi Sun | Xiang Li | Ming Gao
Findings of the Association for Computational Linguistics: ACL 2023
Chengcheng Han | Liqing Cui | Renyu Zhu | Jianing Wang | Nuo Chen | Qiushi Sun | Xiang Li | Ming Gao
Findings of the Association for Computational Linguistics: ACL 2023
Large pre-trained language models (PLMs) have garnered significant attention for their versatility and potential for solving a wide spectrum of natural language processing (NLP) tasks. However, the cost of running these PLMs may be prohibitive. Furthermore, PLMs may not be open-sourced due to commercial considerations and potential risks of misuse, such as GPT-3. The parameters and gradients of PLMs are unavailable in this scenario. To solve the issue, black-box tuning has been proposed, which utilizes derivative-free optimization (DFO), instead of gradient descent, for training task-specific continuous prompts. However, these gradient-free methods still exhibit a significant gap compared to gradient-based methods. In this paper, we introduce gradient descent into black-box tuning scenario through knowledge distillation. Furthermore, we propose a novel method GDFO, which integrates gradient descent and derivative-free optimization to optimize task-specific continuous prompts in a harmonized manner. Experimental results show that GDFO can achieve significant performance gains over previous state-of-the-art methods.
Search
Fix author
Co-authors
- Bingsheng He 4
- Qian Wang 4
- Benyou Wang 3
- Anningzhe Gao 2
- Yufei Gao 2
- Xidong Wang 2
- Chenghao Zhu 2
- Sophia Ananiadou 1
- Yupeng Cao 1
- Xi Chen 1
- Shunian Chen 1
- Hardy Chen 1
- Junying Chen 1
- Zhihong Chen 1
- Arman Cohan 1
- Liqing Cui 1
- Zhiyang Deng 1
- Yun Feng 1
- Heming Fu 1
- Penglei Gao 1
- Ming Gao 1
- Wentao Ge 1
- Polydoros Giannouris 1
- Yuqing Guo 1
- Yi Han 1
- Chengcheng Han 1
- Yueru He 1
- Huan He 1
- Yufei He 1
- Junyi Hou 1
- Yan Hu 1
- Jerry Huang 1
- Jimin Huang 1
- Andre Lin HuiKai 1
- Mingyang Jiang 1
- Yuechen Jiang 1
- Yongnan Jin 1
- Haohang Li 1
- Jianquan Li 1
- Qinbin Li 1
- Yuan Li 1
- Xiang Li 1
- Jingsheng Liang 1
- Shengyuan Lin 1
- Mingquan Lin 1
- Ziyue Lin 1
- Zhiwei Liu 1
- Xiao-Yang Liu 1
- Xu Liu 1
- Alejandro Lopez-Lira 1
- Peng Lu 1
- Bingqiao Luo 1
- Jian-Yun Nie 1
- Triantafillos Papadopoulos 1
- Xueqing Peng 1
- Lingfei Qian 1
- Zou Qingyun 1
- Meikang Qiu 1
- Yang Ren 1
- Kaleb E. Smith 1
- Dingjie Song 1
- Efstathia Soufleri 1
- Qiushi Sun 1
- Zhenheng Tang 1
- Prayag Tiwari 1
- Yicheng Tong 1
- Jun’ichi Tsujii 1
- Xiang Wan 1
- Yan Wang 1
- Xiaoyu Wang 1
- Keyi Wang 1
- Suyuchen Wang 1
- Tianyu Wang 1
- Jianing Wang 1
- Jiaying Wu 1
- Ruoyu Xiang 1
- Qianqian Xie 1
- Wenya Xie 1
- Guojun Xiong 1
- Shuo Yan 1
- Lingyong Yan 1
- Shanshan Yang 1
- Yuzhe Yang 1
- Yangyang Yu 1
- Zining Zhang 1
- Vincent Jim Zhang 1
- Xueyi Zhang 1
- Yunyi Zhang 1
- Jeff Zhao 1
- Yilun Zhao 1
- Yijia Zhao 1
- Zhang Zhiyi 1
- Renyu Zhu 1