Bing Yin
Other people with similar names: Bing Yin
Unverified author pages with similar names: Bing Yin
2026
Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards
Ming Li | Pei Chen | Zhenhao Zhang | Tao Yang | Xinyang Zhang | Han Li | Tianyu Cao | Ming Zeng | Zhuofeng Wu | Meng Jiang | Huasheng Li | Lihong Li | Bing Yin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Li | Pei Chen | Zhenhao Zhang | Tao Yang | Xinyang Zhang | Han Li | Tianyu Cao | Ming Zeng | Zhuofeng Wu | Meng Jiang | Huasheng Li | Lihong Li | Bing Yin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
2025
LongLeader: A Comprehensive Leaderboard for Large Language Models in Long-context Scenarios
Pei Chen | Hongye Jin | Cheng-Che Lee | Rulin Shao | Jingfeng Yang | Mingyu Zhao | Zhaoyu Zhang | Qin Lu | Kaiwen Men | Ning Xie | Huasheng Li | Bing Yin | Han Li | Lingyun Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Pei Chen | Hongye Jin | Cheng-Che Lee | Rulin Shao | Jingfeng Yang | Mingyu Zhao | Zhaoyu Zhang | Qin Lu | Kaiwen Men | Ning Xie | Huasheng Li | Bing Yin | Han Li | Lingyun Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large Language Models (LLMs), exemplified by Claude and LLama, have exhibited impressive proficiency in tackling a myriad of Natural Language Processing (NLP) tasks. Yet, in pursuit of the ambitious goal of attaining Artificial General Intelligence (AGI), there remains ample room for enhancing LLM capabilities. Chief among these is the pressing need to bolster long-context comprehension. Numerous real-world scenarios demand LLMs to adeptly reason across extended contexts, such as multi-turn dialogues or agent workflow. Hence, recent advancements have been dedicated to stretching the upper bounds of long-context comprehension, with models like Claude 3 accommodating up to 200k tokens, employing various techniques to achieve this feat. Aligned with this progression, we propose a leaderboard LongLeader that seeks to comprehensively assess different long-context comprehension abilities of diverse LLMs and context length extension strategies across meticulously selected benchmarks. Specifically, we aim to address the following questions: 1) Do LLMs genuinely deliver the long-context proficiency they purport? 2) Which benchmarks offer reliable metrics for evaluating long-context comprehension? 3) What technical strategies prove effective in extending the understanding of longer contexts? We streamline the evaluation process for LLMs on the benchmarks, offering open-source access to the benchmarks and maintaining a dedicated website for leaderboards. We will continuously curate new datasets and update models to the leaderboards.
IHEval: Evaluating Language Models on Following the Instruction Hierarchy
Zhihan Zhang | Shiyang Li | Zixuan Zhang | Xin Liu | Haoming Jiang | Xianfeng Tang | Yifan Gao | Zheng Li | Haodong Wang | Zhaoxuan Tan | Yichuan Li | Qingyu Yin | Bing Yin | Meng Jiang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Zhihan Zhang | Shiyang Li | Zixuan Zhang | Xin Liu | Haoming Jiang | Xianfeng Tang | Yifan Gao | Zheng Li | Haodong Wang | Zhaoxuan Tan | Yichuan Li | Qingyu Yin | Bing Yin | Meng Jiang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models’ ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs.
Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training
Yuchen Zhuang | Jingfeng Yang | Haoming Jiang | Xin Liu | Kewei Cheng | Sanket Lokegaonkar | Yifan Gao | Qing Ping | Tianyi Liu | Binxuan Huang | Zheng Li | Zhengyang Wang | Pei Chen | Ruijie Wang | Rongzhi Zhang | Nasser Zalmout | Priyanka Nigam | Bing Yin | Chao Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Yuchen Zhuang | Jingfeng Yang | Haoming Jiang | Xin Liu | Kewei Cheng | Sanket Lokegaonkar | Yifan Gao | Qing Ping | Tianyi Liu | Binxuan Huang | Zheng Li | Zhengyang Wang | Pei Chen | Ruijie Wang | Rongzhi Zhang | Nasser Zalmout | Priyanka Nigam | Bing Yin | Chao Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Due to the scarcity of agent-oriented pre-training data, LLM-based autonomous agents typically rely on complex prompting or extensive fine-tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus-Forge, the first large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre-training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.
DrAgent: Empowering Large Language Models as Medical Agents for Multi-hop Medical Reasoning
Fenglin Liu | Zheng Li | Hongjian Zhou | Qingyu Yin | Jingfeng Yang | Xin Liu | Zhengyang Wang | Xianfeng Tang | Shiyang Li | Xiang He | Ruijie Wang | Bing Yin | Xiao Gu | Lei Clifton | David A. Clifton
Findings of the Association for Computational Linguistics: EMNLP 2025
Fenglin Liu | Zheng Li | Hongjian Zhou | Qingyu Yin | Jingfeng Yang | Xin Liu | Zhengyang Wang | Xianfeng Tang | Shiyang Li | Xiang He | Ruijie Wang | Bing Yin | Xiao Gu | Lei Clifton | David A. Clifton
Findings of the Association for Computational Linguistics: EMNLP 2025
Although large language models (LLMs) have demonstrated outperforming human experts in medical examinations, it remains challenging to adopt LLMs in real-world clinical decision-making that typically involves multi-hop medical reasoning. Common practices include prompting commercial LLMs and fine-tuning LLMs on medical data. However, in the clinical domain, using commercial LLMs raises privacy concerns regarding sensitive patient data. Fine-tuning competitive medical LLMs for different tasks usually requires extensive data and computing resources, which are difficult to acquire, especially in medical institutions with limited infrastructure. We propose DrAgent, which can build LLMs as agents to deliver accurate medical decision-making and reasoning. In implementation, we take a lightweight LLM as the backbone to collaborate with diverse clinical tools. To make efficient use of data, DrAgent introduces recursive curriculum learning to optimize the LLM in an easy-to-hard progression. The results show that our approach achieves competitive performance on diverse datasets.
Can Language Models Follow Multiple Turns of Entangled Instructions?
Chi Han | Xin Liu | Haodong Wang | Shiyang Li | Jingfeng Yang | Haoming Jiang | Zhengyang Wang | Qingyu Yin | Liang Qiu | Changlong Yu | Yifan Gao | Zheng Li | Bing Yin | Jingbo Shang | Heng Ji
Findings of the Association for Computational Linguistics: EMNLP 2025
Chi Han | Xin Liu | Haodong Wang | Shiyang Li | Jingfeng Yang | Haoming Jiang | Zhengyang Wang | Qingyu Yin | Liang Qiu | Changlong Yu | Yifan Gao | Zheng Li | Bing Yin | Jingbo Shang | Heng Ji
Findings of the Association for Computational Linguistics: EMNLP 2025
Despite of significant achievements in improving instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflict instructions remains a considerable challenge. Real-world scenarios often require the consistency across multiple instructions over time, such as secret privacy, presonal preferences, and prioritization, so we demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs’ capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct with 1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in a total of nine capability categories, including statics and dynamics, reasoning and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks but their attention mechanisms fail to effectively integrate multiple related instructions. These findings highlight critical areas for improvement in the complex real-world tasks involving multi-turn instructions.
DORM: Preference Data Weights Optimization for Reward Modeling in LLM Alignment
Rongzhi Zhang | Chenwei Zhang | Xinyang Zhang | Liang Qiu | Haoming Jiang | Yuchen Zhuang | Qingru Zhang | Hyokun Yun | Xian Li | Bing Yin | Tuo Zhao | Chao Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Rongzhi Zhang | Chenwei Zhang | Xinyang Zhang | Liang Qiu | Haoming Jiang | Yuchen Zhuang | Qingru Zhang | Hyokun Yun | Xian Li | Bing Yin | Tuo Zhao | Chao Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Aligning large language models (LLMs) with human preferences relies heavily on high-quality reward models. However, existing approaches struggle with two critical challenges: noisy preference labels and the varying importance of preference samples. We introduce DORM, a method that enhances reward modeling by learning to dynamically weigh preference data.DORM initializes data importance using a combination of model uncertainty and prediction disagreement, then iteratively refines them via bilevel optimization to maximize validation performance. Using only 50k samples, DORM trains a 12B reward model that achieves 90.5% accuracy on RewardBench, matching the performance of models trained on significantly larger datasets. Furthermore, downstream alignment tasks show that fine-tuned LLMs with DORM achieve a 61.2% win rate against baseline methods, highlighting its data efficiency and generalizability.
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
Zhepei Wei | Wenlin Yao | Yao Liu | Weizhi Zhang | Qin Lu | Liang Qiu | Changlong Yu | Puyang Xu | Chao Zhang | Bing Yin | Hyokun Yun | Lihong Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zhepei Wei | Wenlin Yao | Yao Liu | Weizhi Zhang | Qin Lu | Liang Qiu | Changlong Yu | Puyang Xu | Chao Zhang | Bing Yin | Hyokun Yun | Lihong Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and LLaMA-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.
Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates
Hy Dang | Tianyi Liu | Zhuofeng Wu | Jingfeng Yang | Haoming Jiang | Tao Yang | Pei Chen | Zhengyang Wang | Helen Wang | Huasheng Li | Bing Yin | Meng Jiang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Hy Dang | Tianyi Liu | Zhuofeng Wu | Jingfeng Yang | Haoming Jiang | Tao Yang | Pei Chen | Zhengyang Wang | Helen Wang | Huasheng Li | Bing Yin | Meng Jiang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities, yet they often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent. These issues often stem from an incomplete understanding of user goals and inadequate comprehension of tool documentation. While Chain-of-Thought (CoT) prompting has proven effective for enhancing reasoning in general contexts, our analysis reveals that free-form CoT is insufficient and sometimes counterproductive for structured function-calling tasks. To address this, we introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings. Experimental results show that our method reduces tool-use errors, achieving 3–12% relative improvements over strong baselines across diverse model series and approaches. Moreover, our framework enhances the robustness, interpretability, and transparency of tool-using agents, advancing the development of more reliable AI assistants for real-world applications.
AutoMixAlign: Adaptive Data Mixing for Multi-Task Preference Optimization in LLMs
Nicholas E. Corrado | Julian Katz-Samuels | Adithya M Devraj | Hyokun Yun | Chao Zhang | Yi Xu | Yi Pan | Bing Yin | Trishul Chilimbi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Nicholas E. Corrado | Julian Katz-Samuels | Adithya M Devraj | Hyokun Yun | Chao Zhang | Yi Xu | Yi Pan | Bing Yin | Trishul Chilimbi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
When aligning large language models (LLMs), their performance across various tasks (such as being helpful, harmless, and honest) is heavily influenced by the composition of the training data. However, it is difficult to determine what mixture of data should be used to produce a model with strong performance across all tasks. Existing approaches rely on large ablation studies, heuristics, or human intuition, though these can be prohibitively expensive and suboptimal. We study this problem in the context of preference optimization via DPO and propose a novel and theoretically justified algorithm, AutoMixAlign (AMA), that adaptively mixes datasets during LLM training to balance performance across multiple tasks. AMA first trains specialist models for each task to determine losses that corresponding to strong task performance. Next, AMA trains a generalist model using a novel minimax optimization that prioritizes tasks for which generalist model losses are furthest from specialist model losses. We introduce two algorithms to optimize this problem: (1) AMA-R adaptively reweights the objective to prioritize tasks, and (2) AMA-S adaptively adjusts how much data is sampled from each task to prioritize tasks. Both algorithms achieve a convergence rate of O(1/√T) in the convex case. AMA-R’s convergence result immediately follows from Sagawa et. al, 2019, and we provide a convergence proof for AMA-S using techniques from online learning such as EXP3 (Auer et. al, 2002). We evaluate AMA on several multitask alignment setups, and observe that AMA outperforms the standard alignment approach which simply optimizes the total loss across all tasks and also outperforms model-merging methods.
Aligning Large Language Models with Implicit Preferences from User-Generated Content
Zhaoxuan Tan | Zheng Li | Tianyi Liu | Haodong Wang | Hyokun Yun | Ming Zeng | Pei Chen | Zhihan Zhang | Yifan Gao | Ruijie Wang | Priyanka Nigam | Bing Yin | Meng Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhaoxuan Tan | Zheng Li | Tianyi Liu | Haodong Wang | Hyokun Yun | Ming Zeng | Pei Chen | Zhihan Zhang | Yifan Gao | Ruijie Wang | Priyanka Nigam | Bing Yin | Meng Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers’ questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at https://zhaoxuan.info/PUGC.github.io/.
UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations
Fengran Mo | Yifan Gao | Chuan Meng | Xin Liu | Zhuofeng Wu | Kelong Mao | Zhengyang Wang | Pei Chen | Zheng Li | Xian Li | Bing Yin | Meng Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fengran Mo | Yifan Gao | Chuan Meng | Xin Liu | Zhuofeng Wu | Kelong Mao | Zhengyang Wang | Pei Chen | Zheng Li | Xian Li | Bing Yin | Meng Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
2023
FolkScope: Intention Knowledge Graph Construction for E-commerce Commonsense Discovery
Changlong Yu | Weiqi Wang | Xin Liu | Jiaxin Bai | Yangqiu Song | Zheng Li | Yifan Gao | Tianyu Cao | Bing Yin
Findings of the Association for Computational Linguistics: ACL 2023
Changlong Yu | Weiqi Wang | Xin Liu | Jiaxin Bai | Yangqiu Song | Zheng Li | Yifan Gao | Tianyu Cao | Bing Yin
Findings of the Association for Computational Linguistics: ACL 2023
Understanding users’ intentions in e-commerce platforms requires commonsense knowledge. In this paper, we present FolkScope, an intention knowledge graph construction framework, to reveal the structure of humans’ minds about purchasing items. As commonsense knowledge is usually ineffable and not expressed explicitly, it is challenging to perform information extraction. Thus, we propose a new approach that leverages the generation power of large language models (LLMs) and human-in-the-loop annotation to semi-automatically construct the knowledge graph. LLMs first generate intention assertions via e-commerce specific prompts to explain shopping behaviors, where the intention can be an open reason or a predicate falling into one of 18 categories aligning with ConceptNet, e.g., IsA, MadeOf, UsedFor, etc. Then we annotate plausibility and typicality labels of sampled intentions as training data in order to populate human judgments to all automatic generations. Last, to structurize the assertions, we propose pattern mining and conceptualization to form more condensed and abstract knowledge. Extensive evaluations and study demonstrate that our constructed knowledge graph can well model e-commerce knowledge and have many potential applications.
Graph Reasoning for Question Answering with Triplet Retrieval
Shiyang Li | Yifan Gao | Haoming Jiang | Qingyu Yin | Zheng Li | Xifeng Yan | Chao Zhang | Bing Yin
Findings of the Association for Computational Linguistics: ACL 2023
Shiyang Li | Yifan Gao | Haoming Jiang | Qingyu Yin | Zheng Li | Xifeng Yan | Chao Zhang | Bing Yin
Findings of the Association for Computational Linguistics: ACL 2023
Answering complex questions often requires reasoning over knowledge graphs (KGs). State-of-the-art methods often utilize entities in questions to retrieve local subgraphs, which are then fed into KG encoder, e.g. graph neural networks (GNNs), to model their local structures and integrated into language models for question answering. However, this paradigm constrains retrieved knowledge in local subgraphs and discards more diverse triplets buried in KGs that are disconnected but useful for question answering. In this paper, we propose a simple yet effective method to first retrieve the most relevant triplets from KGs and then rerank them, which are then concatenated with questions to be fed into language models. Extensive results on both CommonsenseQA and OpenbookQA datasets show that our method can outperform state-of-the-art up to 4.6% absolute accuracy.
Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels
Bang Yang | Fenglin Liu | Zheng Li | Qingyu Yin | Chenyu You | Bing Yin | Yuexian Zou
Findings of the Association for Computational Linguistics: ACL 2023
Bang Yang | Fenglin Liu | Zheng Li | Qingyu Yin | Chenyu You | Bing Yin | Yuexian Zou
Findings of the Association for Computational Linguistics: ACL 2023
Generating an informative and attractive title for the product is a crucial task for e-commerce. Most existing works follow the standard multimodal natural language generation approaches, e.g., image captioning, and employ the large scale of human-labelled datasets to train desirable models. However, for novel products, especially in a different domain, there are few existing labelled data. In this paper, we propose a prompt-based approach, i.e., the Multimodal Prompt Learning framework, to accurately and efficiently generate titles for novel products with limited labels. We observe that the core challenges of novel product title generation are the understanding of novel product characteristics and the generation of titles in a novel writing style. To this end, we build a set of multimodal prompts from different modalities to preserve the corresponding characteristics and writing styles of novel products. As a result, with extremely limited labels for training, the proposed method can retrieve the multimodal prompts to generate desirable titles for novel products. The experiments and analyses are conducted on five novel product categories under both the in-domain and out-of-domain experimental settings. The results show that, with only 1% of downstream labelled data for training, our proposed approach achieves the best few-shot results and even achieves competitive results with fully-supervised methods trained on 100% of training data; With the full labelled data for training, our method achieves state-of-the-art results.
SCOTT: Self-Consistent Chain-of-Thought Distillation
Peifeng Wang | Zhengyang Wang | Zheng Li | Yifan Gao | Bing Yin | Xiang Ren
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Peifeng Wang | Zhengyang Wang | Zheng Li | Yifan Gao | Bing Yin | Xiang Ren
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LMs) beyond a certain scale, demonstrate the emergent capability of generating free-text rationales for their predictions via chain-of-thought (CoT) prompting. While CoT can yield dramatically improved performance, such gains are only observed for sufficiently large LMs. Even more concerning, there is little guarantee that the generated rationales are consistent with LM’s predictions or faithfully justify the decisions. In this work, we propose SCOTT, a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger. To form better supervision, we elicit rationales supporting the gold answers from a large LM (teacher) by contrastive decoding, which encourages the teacher to generate tokens that become more plausible only when the answer is considered. To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective, which prevents the student from ignoring the rationales to make inconsistent predictions. Experiments show that while yielding comparable performance, our method leads to a more faithful model than baselines. Further analysis shows that such a model respects the rationales more when making decisions; thus, we can improve its performance more by refining its rationales.
Search
Fix author
Co-authors
- Zheng Li 10
- Yifan Gao 8
- Pei Chen 6
- Haoming Jiang 6
- Xin Liu 6
- Meng Jiang 5
- Zhengyang Wang 5
- Jingfeng Yang 5
- Qingyu Yin 5
- Chao Zhang 5
- Shiyang Li 4
- Hyokun Yun 4
- Huasheng Li 3
- Tianyi Liu 3
- Liang Qiu 3
- Haodong Wang 3
- Ruijie Wang 3
- Zhuofeng Wu 3
- Changlong Yu 3
- Tianyu Cao 2
- Han Li 2
- Lihong Li 2
- Xian Li 2
- Fenglin Liu 2
- Qin Lu 2
- Priyanka Nigam 2
- Zhaoxuan Tan 2
- Xianfeng Tang 2
- Tao Yang 2
- Ming Zeng 2
- Rongzhi Zhang 2
- Xinyang Zhang 2
- Zhihan Zhang 2
- Yuchen Zhuang 2
- Jiaxin Bai 1
- Kewei Cheng 1
- Trishul Chilimbi 1
- David A. Clifton 1
- Lei Clifton 1
- Nicholas E. Corrado 1
- Hy Dang 1
- Adithya M Devraj 1
- Xiao Gu 1
- Chi Han 1
- Xiang He 1
- Binxuan Huang 1
- Heng Ji 1
- Hongye Jin 1
- Julian Katz-Samuels 1
- Cheng-Che Lee 1
- Ming Li 1
- Yichuan Li 1
- Yao Liu 1
- Sanket Lokegaonkar 1
- Kelong Mao 1
- Kaiwen Men 1
- Chuan Meng 1
- Fengran Mo 1
- Yi Pan 1
- Qing Ping 1
- Xiang Ren 1
- Jingbo Shang 1
- Rulin Shao 1
- Yangqiu Song 1
- Helen Wang 1
- Lingyun Wang 1
- Peifeng Wang 1
- Weiqi Wang 1
- Zhengyang Wang 1
- Zhepei Wei 1
- Ning Xie 1
- Puyang Xu 1
- Yi Xu 1
- Xifeng Yan 1
- Bang Yang 1
- Wenlin Yao 1
- Chenyu You 1
- Nasser Zalmout 1
- Chenwei Zhang 1
- Qingru Zhang 1
- Weizhi Zhang 1
- Zhaoyu Zhang 1
- Zhenhao Zhang 1
- Zixuan Zhang 1
- Mingyu Zhao 1
- Tuo Zhao 1
- Hongjian Zhou 1
- Yuexian Zou 1