Yuan Wang
2026
Towards Proactive Personalization through Profile Customization for Individual Users in Dialogues
Xiaotian Zhang | Yuan Wang | Ruizhe Chen | Zeya Wang | Runchen Hou | Zuozhu Liu
Findings of the Association for Computational Linguistics: ACL 2026
Xiaotian Zhang | Yuan Wang | Ruizhe Chen | Zeya Wang | Runchen Hou | Zuozhu Liu
Findings of the Association for Computational Linguistics: ACL 2026
The deployment of Large Language Models (LLMs) in interactive systems necessitates a deep alignment with the nuanced and dynamic preferences of individual users. Current alignment techniques predominantly address universal human values or static, single-turn preferences, thereby failing to address the critical needs of long-term personalization and the initial user cold-start problem. To bridge this gap, we propose PersonalAgent, a novel user-centric lifelong agent designed to continuously infer and adapt to user preferences. PersonalAgent constructs and dynamically refines a unified user profile by decomposing dialogues into single-turn interactions, framing preference inference as a sequential decision-making task. Experiments show that PersonalAgent achieves superior performance over strong prompt-based and policy optimization baselines, not only in idealized but also in noisy conversational contexts, while preserving cross-session preference consistency. Furthermore, human evaluation confirms that PersonalAgent excels at capturing user preferences naturally and coherently. Our findings underscore the importance of lifelong personalization for developing more inclusive and adaptive conversational agents. Our code and ALOE-Unseen dataset are released here.
Act as you think: Reinforcing Consistent Reasoning in Medical Visual Question Answering
Songtao Jiang | Yuan Wang | Ruizhe Chen | Yan Zhang | Ruilin Luo | Bohan Lei | Yeying Jin | Sibo Song | ZhiBo Yang | Jimeng Sun | Jian Wu | Zuozhu Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Songtao Jiang | Yuan Wang | Ruizhe Chen | Yan Zhang | Ruilin Luo | Bohan Lei | Yeying Jin | Sibo Song | ZhiBo Yang | Jimeng Sun | Jian Wu | Zuozhu Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While reinforcement learning from verifiable rewards (RLVR) has been proven highly effective for enhancing reasoning, its application to medical visual question answering (Med-VQA) is hampered by models producing reasoning inconsistent with either the visual evidence or the final answer. Our analysis reveals a critical flaw in RLVR training: it paradoxically encourages models to disregard visual evidence and generate answers that contradict their own reasoning. This degradation is most pronounced in specialized medical modalities (e.g., Fundus, Ultrasound) where base VLMs lack robust understanding, a failure we attribute to a flawed reward mechanism exacerbated by the scarcity of diverse training data. To tackle this, we introduce Med-Zero-17K, a large-scale dataset spanning over 30 modalities and 24 clinically relevant tasks, and the Multi-Consistency Reward (MCR) framework, which explicitly rewards both perceptual grounding and logical coherence. Extensive experiments validate our approach: integrating MCR into the RLVR framework delivers robust performance gains. This success stems from our crucial finding that rewarding internal consistency is significantly more effective than attempting to judge reasoning correctness. Furthermore, MCR proves highly versatile, exhibiting strong generalization across diverse VLM backbones, compatibility with RL algorithms like GRPO and DPO, and extending its effectiveness to 3D VQA tasks and R1-style training paradigms. Code and dataset will be released.
Data Efficient RLVR via Off-Policy Influence Guidance
Erle Zhu | Dazhi Jiang | Yuan Wang | Xujun Li | Jiale Cheng | Yuxian Gu | Yilin Niu | Aohan Zeng | Jie Tang | Minlie Huang | Hongning Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Erle Zhu | Dazhi Jiang | Yuan Wang | Xujun Li | Jiale Cheng | Yuxian Gu | Yilin Niu | Aohan Zeng | Jie Tang | Minlie Huang | Hongning Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop Curriculum RL with Off-Policy Influence guidance (CROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.
2025
Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts
Xuyang Wu | Yuan Wang | Hsin-Tai Wu | Zhiqiang Tao | Yi Fang
Findings of the Association for Computational Linguistics: EMNLP 2025
Xuyang Wu | Yuan Wang | Hsin-Tai Wu | Zhiqiang Tao | Yi Fang
Findings of the Association for Computational Linguistics: EMNLP 2025
Large vision-language models (LVLMs) have recently achieved significant progress, demonstrating strong capabilities in open-world visual understanding. However, it is not yet clear how LVLMs address demographic biases in real life, especially the disparities across attributes such as gender, skin tone, age and race. In this paper, We empirically investigate visual fairness in several mainstream LVLMs by auditing their performance disparities across demographic attributes using public fairness benchmark datasets (e.g., FACET, UTKFace). Our fairness evaluation framework employs direct and single-choice question prompt on visual question-answering/classification tasks. Despite advancements in visual understanding, our zero-shot prompting results show that both open-source and closed-source LVLMs continue to exhibit fairness issues across different prompts and demographic groups. Furthermore, we propose a potential multi-modal Chain-of-thought (CoT) based strategy for unfairness mitigation, applicable to both open-source and closed-source LVLMs. This approach enhances transparency and offers a scalable solution for addressing fairness, providing a solid foundation for future research and practical efforts in unfairness mitigation. The dataset and code used in this study are publicly available at this GitHub Repository.
ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry
Qinwen Chen | Wenbiao Tao | Zhiwei Zhu | Mingfan Xi | Liangzhong Guo | Yuan Wang | Wei Wang | Yunshi Lan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Qinwen Chen | Wenbiao Tao | Zhiwei Zhu | Mingfan Xi | Liangzhong Guo | Yuan Wang | Wei Wang | Yunshi Lan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Community Question Answering (CQA) platforms can be deemed as important knowledge bases in community, but effectively leveraging historical interactions and domain knowledge in real-time remains a challenge. Existing methods often underutilize external knowledge, fail to incorporate dynamic historical QA context, or lack memory mechanisms suited for industrial deployment. We propose ComRAG, a retrieval-augmented generation framework for real-time industrial CQA that integrates static knowledge with dynamic historical QA pairs via a centroid-based memory mechanism designed for retrieval, generation, and efficient storage. Evaluated on three industrial CQA datasets, ComRAG consistently outperforms all baselines—achieving up to 25.9% improvement in vector similarity, reducing latency by 8.7%–23.3%, and lowering chunk growth from 20.23% to 2.06% over iterations.
2024
MedCoT: Medical Chain of Thought via Hierarchical Expert
Jiaxiang Liu | Yuan Wang | Jiawei Du | Joey Tianyi Zhou | Zuozhu Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Jiaxiang Liu | Yuan Wang | Jiawei Du | Joey Tianyi Zhou | Zuozhu Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Artificial intelligence has advanced in Medical Visual Question Answering (Med-VQA), but prevalent research tends to focus on the accuracy of the answers, often overlooking the reasoning paths and interpretability, which are crucial in clinical settings. Besides, current Med-VQA algorithms, typically reliant on singular models, lack the robustness needed for real-world medical diagnostics which usually require collaborative expert evaluation. To address these shortcomings, this paper presents MedCoT, a novel hierarchical expert verification reasoning chain method designed to enhance interpretability and accuracy in biomedical imaging inquiries. MedCoT is predicated on two principles: The necessity for explicit reasoning paths in Med-VQA and the requirement for multi-expert review to formulate accurate conclusions. The methodology involves an Initial Specialist proposing diagnostic rationales, followed by a Follow-up Specialist who validates these rationales, and finally, a consensus is reached through a vote among a sparse Mixture of Experts within the locally deployed Diagnostic Specialist, which then provides the definitive diagnosis. Experimental evaluations on four standard Med-VQA datasets demonstrate that MedCoT surpasses existing state-of-the-art approaches, providing significant improvements in performance and interpretability.
Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers
Yuan Wang | Xuyang Wu | Hsin-Tai Wu | Zhiqiang Tao | Yi Fang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Yuan Wang | Xuyang Wu | Hsin-Tai Wu | Zhiqiang Tao | Yi Fang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The integration of Large Language Models (LLMs) in information retrieval has raised a critical reevaluation of fairness in the text-ranking models. LLMs, such as GPT models and Llama2, have shown effectiveness in natural language understanding tasks, and prior works such as RankGPT have demonstrated that the LLMs have better performance than the traditional ranking models in the ranking task. However, their fairness remains largely unexplored. This paper presents an empirical study evaluating these LLMs using the TREC Fair Ranking dataset, focusing on the representation of binary protected attributes such as gender and geographic location, which are historically underrepresented in search outcomes. Our analysis delves into how these LLMs handle queries and documents related to these attributes, aiming to uncover biases in their ranking algorithms. We assess fairness from both user and content perspectives, contributing an empirical benchmark for evaluating LLMs as the fair ranker.
2022
Exploring Dual Encoder Architectures for Question Answering
Zhe Dong | Jianmo Ni | Dan Bikel | Enrique Alfonseca | Yuan Wang | Chen Qu | Imed Zitouni
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Zhe Dong | Jianmo Ni | Dan Bikel | Enrique Alfonseca | Yuan Wang | Chen Qu | Imed Zitouni
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Dual encoders have been used for question-answering (QA) and information retrieval (IR) tasks with good results. There are two major types of dual encoders, Siamese Dual Encoders (SDE), with parameters shared across two encoders, and Asymmetric Dual Encoder (ADE), with two distinctly parameterized encoders. In this work, we explore the dual encoder architectures for QA retrieval tasks. By evaluating on MS MARCO, open domain NQ, and the MultiReQA benchmarks, we show that SDE performs significantly better than ADE. We further propose three different improved versions of ADEs. Based on the evaluation of QA retrieval tasks and direct analysis of the embeddings, we demonstrate that sharing parameters in projection layers would enable ADEs to perform competitively with SDEs.
2019
Toward Automated Content Feedback Generation for Non-native Spontaneous Speech
Su-Youn Yoon | Ching-Ni Hsieh | Klaus Zechner | Matthew Mulholland | Yuan Wang | Nitin Madnani
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Su-Youn Yoon | Ching-Ni Hsieh | Klaus Zechner | Matthew Mulholland | Yuan Wang | Nitin Madnani
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
In this study, we developed an automated algorithm to provide feedback about the specific content of non-native English speakers’ spoken responses. The responses were spontaneous speech, elicited using integrated tasks where the language learners listened to and/or read passages and integrated the core content in their spoken responses. Our models detected the absence of key points considered to be important in a spoken response to a particular test question, based on two different models: (a) a model using word-embedding based content features and (b) a state-of-the art short response scoring engine using traditional n-gram based features. Both models achieved a substantially improved performance over the majority baseline, and the combination of the two models achieved a significant further improvement. In particular, the models were robust to automated speech recognition (ASR) errors, and performance based on the ASR word hypotheses was comparable to that based on manual transcriptions. The accuracy and F-score of the best model for the questions included in the train set were 0.80 and 0.68, respectively. Finally, we discussed possible approaches to generating targeted feedback about the content of a language learner’s response, based on automatically detected missing key points.
2016
Predicting Restaurant Consumption Level through Social Media Footprints
Yang Xiao | Yuan Wang | Hangyu Mao | Zhen Xiao
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Yang Xiao | Yuan Wang | Hangyu Mao | Zhen Xiao
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Accurate prediction of user attributes from social media is valuable for both social science analysis and consumer targeting. In this paper, we propose a systematic method to leverage user online social media content for predicting offline restaurant consumption level. We utilize the social login as a bridge and construct a dataset of 8,844 users who have been linked across Dianping (similar to Yelp) and Sina Weibo. More specifically, we construct consumption level ground truth based on user self report spending. We build predictive models using both raw features and, especially, latent features, such as topic distributions and celebrities clusters. The employed methods demonstrate that online social media content has strong predictive power for offline spending. Finally, combined with qualitative feature analysis, we present the differences in words usage, topic interests and following behavior between different consumption level groups.
Search
Fix author
Co-authors
- Zuozhu Liu 3
- Ruizhe Chen 2
- Zhiqiang Tao 2
- Xuyang Wu 2
- Hsin-Tai Wu 2
- Yang Xiao 2
- Zhen Xiao 2
- Enrique Alfonseca 1
- Daniel M. Bikel 1
- Qinwen Chen 1
- Jiale Cheng 1
- Zhe Dong 1
- Jiawei Du 1
- Yi Fang 1
- Yi Fang 1
- Yuxian Gu 1
- Liangzhong Guo 1
- Runchen Hou 1
- Ching-Ni Hsieh 1
- Minlie Huang 1
- Songtao Jiang 1
- Dazhi Jiang 1
- Yeying Jin 1
- Yunshi Lan 1
- Bohan Lei 1
- Xujun Li 1
- Jiaxiang Liu 1
- Ruilin Luo 1
- Chao Ma 1
- Nitin Madnani 1
- Hangyu Mao 1
- Matthew Mulholland 1
- Jianmo Ni 1
- Yilin Niu 1
- Chen Qu 1
- Sibo Song 1
- Jimeng Sun 1
- Jie Tang 1
- Wenbiao Tao 1
- Zeya Wang 1
- Wei Wang 1
- Hongning Wang 1
- Jian Wu 1
- Mingfan Xi 1
- ZhiBo Yang 1
- Su-Youn Yoon 1
- Klaus Zechner 1
- Aohan Zeng 1
- Xiaotian Zhang 1
- Yan Zhang 1
- Joey Tianyi Zhou 1
- Zhiwei Zhu 1
- Erle Zhu 1
- Imed Zitouni 1