2025
pdf
bib
abs
Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training
Zheheng Luo
|
Xin Zhang
|
Xiao Liu
|
Haoling Li
|
Yeyun Gong
|
Qi Chen
|
Peng Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
It is well-known that a diverse corpus is critical for training large language models, which are typically constructed from a mixture of various domains. In general, previous efforts resort to either sampling training data from different domains with static proportions or dynamically adjusting these proportions during training to optimise pretraining performance. However, few methods addressed the complexities of domain-adaptive continual pre-training. To fill this gap, we propose Velocitune, a novel framework that dynamically assesses learning velocity and adjusts data proportions accordingly, favouring slower learning domains while de-emphasising faster learning ones, which is guided by a scaling law to estimate the desired learning goal for each domain with a less associated cost. To evaluate the effectiveness of Velocitune, we conduct experiments on a dataset focused on reasoning tasks with CodeLlama, as well as on a corpus of system commands using Llama3 and Mistral. Velocitune achieves performance gains in both math and code reasoning tasks and command-line generation benchmarks. Further analysis reveals that key factors driving Velocitune’s effectiveness include target estimation and data ordering.
pdf
bib
abs
ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering
Jingxuan Wei
|
Nan Xu
|
Junnan Zhu
|
Haoyanni
|
Gaowei Wu
|
Qi Chen
|
Bihui Yu
|
Lei Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.
pdf
bib
abs
InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles
Zizhen Li
|
Chuanhao Li
|
Yibin Wang
|
Qi Chen
|
Diping Song
|
Yukang Feng
|
Jianwen Sun
|
Jiaxin Ai
|
Fanrui Zhang
|
Mingzhu Sun
|
Kaipeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
LLMs have shown strong performance on human-centric reasoning tasks. While previous evaluations have explored whether LLMs can infer intentions or detect deception, they often overlook the individualized reasoning styles that influence how people interpret and act in social contexts. Social deduction games (SDGs) provide a natural testbed for evaluating individualized reasoning styles, where different players may adopt diverse but contextually valid reasoning strategies under identical conditions. To address this, we introduce InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can capture and apply personalized reasoning styles in SDGs. InMind enhances structured gameplay data with round-level strategy traces and post-game reflections, collected under both Observer and Participant modes. It supports four cognitively motivated tasks that jointly evaluate both static alignment and dynamic adaptation. As a case study, we apply InMind to the game Avalon, evaluating 11 state-of-the-art LLMs. General-purpose LLMs, even GPT-4o frequently rely on lexical cues, struggling to anchor reflections in temporal gameplay or adapt to evolving strategies. In contrast, reasoning-enhanced LLMs like DeepSeek-R1 exhibit early signs of style-sensitive reasoning. These findings reveal key limitations in current LLMs’ capacity for individualized, adaptive reasoning, and position InMind as a step toward cognitively aligned human–AI interaction.
pdf
bib
abs
Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement
Haotan Guo
|
Jianfei He
|
Jiayuan Ma
|
Hongbin Na
|
Zimu Wang
|
Haiyang Zhang
|
Qi Chen
|
Wei Wang
|
Zijing Shi
|
Tao Shen
|
Ling Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile PCR-ToxiCN, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors’ limits, and a lightweight mitigation technique that advances research on robust toxicity detection.
pdf
bib
FinDebate: Multi-Agent Collaborative Intelligence for Financial Analysis
Tianshi Cai
|
Guanxu Li
|
Nijia Han
|
Ce Huang
|
Zimu Wang
|
Changyu Zeng
|
Yuqi Wang
|
Jingshi Zhou
|
Haiyang Zhang
|
Qi Chen
|
Yushan Pan
|
Shuihua Wang
|
Wei Wang
Proceedings of The 10th Workshop on Financial Technology and Natural Language Processing
2024
pdf
bib
abs
Exploring Faithful and Informative Commonsense Reasoning and Moral Understanding in Children’s Stories
Zimu Wang
|
Wang Yuqi
|
Nijia Han
|
Qi Chen
|
Haiyang Zhang
|
Yushan Pan
|
Qiufeng Wang
|
Wei Wang
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
“Commonsense reasoning and moral understanding are crucial tasks in artificial intelligence (AI) and natural language processing (NLP). However, existing research often falls short in terms of faithfulness and informativeness during the reasoning process. We propose a novel framework for performing commonsense reasoning and moral understanding using large language models (LLMs), involving constructing guided prompts by incorporating relevant knowledge for commonsense reasoning and extracting facts from stories for moral understanding. We conduct extensive experiments on the Commonsense Reasoning and Moral Understanding in Children’s Stories (CRMUS) dataset with widely recognised LLMs under both zero-shot and fine-tuning settings, demonstrating the effectiveness of our proposed method. Furthermore, we analyse the adaptability of different LLMs in extracting facts for moral understanding performance.”
pdf
bib
abs
DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness
Yuqi Wang
|
Zeqiang Wang
|
Wei Wang
|
Qi Chen
|
Kaizhu Huang
|
Anh Nguyen
|
Suparna De
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.
pdf
bib
abs
Knowledge Distillation from Monolingual to Multilingual Models for Intelligent and Interpretable Multilingual Emotion Detection
Yuqi Wang
|
Zimu Wang
|
Nijia Han
|
Wei Wang
|
Qi Chen
|
Haiyang Zhang
|
Yushan Pan
|
Anh Nguyen
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Emotion detection from text is a crucial task in understanding natural language with wide-ranging applications. Existing approaches for multilingual emotion detection from text face challenges with data scarcity across many languages and a lack of interpretability. We propose a novel method that leverages both monolingual and multilingual pre-trained language models to improve performance and interpretability. Our approach involves 1) training a high-performing English monolingual model in parallel with a multilingual model and 2) using knowledge distillation to transfer the emotion detection capabilities from the monolingual teacher to the multilingual student model. Experiments on a multilingual dataset demonstrate significant performance gains for refined multilingual models like XLM-RoBERTa and E5 after distillation. Furthermore, our approach enhances interpretability by enabling better identification of emotion-trigger words. Our work presents a promising direction for building accurate, robust and explainable multilingual emotion detection systems.
2023
pdf
bib
abs
Prompt-based Zero-shot Text Classification with Conceptual Knowledge
Yuqi Wang
|
Wei Wang
|
Qi Chen
|
Kaizhu Huang
|
Anh Nguyen
|
Suparna De
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
In recent years, pre-trained language models have garnered significant attention due to their effectiveness, which stems from the rich knowledge acquired during pre-training. To mitigate the inconsistency issues between pre-training tasks and downstream tasks and to facilitate the resolution of language-related issues, prompt-based approaches have been introduced, which are particularly useful in low-resource scenarios. However, existing approaches mostly rely on verbalizers to translate the predicted vocabulary to task-specific labels. The major limitations of this approach are the ignorance of potentially relevant domain-specific words and being biased by the pre-training data. To address these limitations, we propose a framework that incorporates conceptual knowledge for text classification in the extreme zero-shot setting. The framework includes prompt-based keyword extraction, weight assignment to each prompt keyword, and final representation estimation in the knowledge graph embedding space. We evaluated the method on four widely-used datasets for sentiment analysis and topic detection, demonstrating that it consistently outperforms recently-developed prompt-based approaches in the same experimental settings.
2020
pdf
bib
MonaLog: a Lightweight System for Natural Language Inference Based on Monotonicity
Hai Hu
|
Qi Chen
|
Kyle Richardson
|
Atreyee Mukherjee
|
Lawrence S. Moss
|
Sandra Kuebler
Proceedings of the Society for Computation in Linguistics 2020
2019
pdf
bib
abs
Natural Language Inference with Monotonicity
Hai Hu
|
Qi Chen
|
Larry Moss
Proceedings of the 13th International Conference on Computational Semantics - Short Papers
This paper describes a working system which performs natural language inference using polarity-marked parse trees. The system handles all of the instances of monotonicity inference in the FraCaS data set. Except for the initial parse, it is entirely deterministic. It handles multi-premise arguments, and the kind of inference performed is essentially “logical”, but it goes beyond what is representable in first-order logic. In any case, the system works on surface forms rather than on representations of any kind.
2018
pdf
bib
abs
Auto-Dialabel: Labeling Dialogue Data with Unsupervised Learning
Chen Shi
|
Qi Chen
|
Lei Sha
|
Sujian Li
|
Xu Sun
|
Houfeng Wang
|
Lintao Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
The lack of labeled data is one of the main challenges when building a task-oriented dialogue system. Existing dialogue datasets usually rely on human labeling, which is expensive, limited in size, and in low coverage. In this paper, we instead propose our framework auto-dialabel to automatically cluster the dialogue intents and slots. In this framework, we collect a set of context features, leverage an autoencoder for feature assembly, and adapt a dynamic hierarchical clustering method for intent and slot labeling. Experimental results show that our framework can promote human labeling cost to a great extent, achieve good intent clustering accuracy (84.1%), and provide reasonable and instructive slot labeling results.
pdf
bib
Detecting Free Translation in Parallel Corpora from Attention Scores
Qi Chen
|
Oi Yee Kwong
|
Jingbo Zhu
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation